2015-09-09 57 views
2

我有一组日期:性能 - 在文本文件中搜索字符串 - Python的

dates1 = {'21/5/2015', '4/4/2015', '15/6/2015', '30/1/2015', '19/3/2015', '25/2/2015', '25/5/2015', '8/2/2015', '6/6/2015', '15/3/2015', '15/1/2015', '30/5/2015'} 

同一时间(从现在开始,“数据”)出现在文本。这是一个很长的文字。我想循环播放文本并获取每个日期在文本中出现的次数,然后打印5个日期,其中包含更多的发生日期。

什么,我现在是这样的:

def dates(data, dates1): 
    lines = data.split("\n") 
    dict_days = {} 
    for day in dates1: 
     count = 0 
     for line in lines: 
      if day in line: 
       count += 1 
     dict_days[day] = count 

    newA = heapq.nlargest(5, dict_days, key=dict_days.get) 

    print(newA) 

我劈线TEX,创建一个字典,对列表中的每一个日期,它看起来它在每一行,如果发现它增加了1计数。

这工作正常,但它需要运行此方法的looong时间。

那么什么,我问的是,如果有人知道一个更有效的方式做同样的

任何帮助将是非常赞赏

编辑

我会尝试每一个答案并让你知道,在此先感谢

+3

警告:'如果行中的日期:'是危险的,因为如果'日=='1/1/2015''它将在'21/1/2015行中“'。 – DSM

+0

使用正则表达式代替'if day in line',并用'\ b'包围这些标记,如果这些标记会作为整个单词出现的话。 – mpcabd

+0

奇异的捕获@DSM – taesu

回答

7

循环上线一次,提取任何日期,检查日期是否在集合中,如果所以使用Counter字典的计数,在结束通话Counter.most_common获得5个最常见日期递增计数:

dates1 = {'21/5/2015', '4/4/2015', '15/6/2015', '30/1/2015', '19/3/2015', '25/2/2015', '25/5/2015', '8/2/2015', '6/6/2015', '15/3/2015', '15/1/2015', '30/5/2015'} 


from collections import Counter 
import re 

def dates(data, dates1): 
    lines = data.split("\n") 
    dict_days = Counter() 
    r = re.compile("\d+/\d+/\d+") 
    for line in lines: 
     match = r.search(line) 
     if match: 
      dte = match.group() 
      if dte in dates1: 
       dict_days[dte] += 1 
    return dict_days.most_common(5) 

这确实过线列表中选择一个单程,而不是一个通行证date1中的每个日期。

为10万线,在字符串的结尾的日期字符串与200个+字符:

In [9]: from random import choice 

In [10]: dates1 = {'21/5/2015', '4/4/2015', '15/6/2015', '30/1/2015', '19/3/2015', '25/2/2015', '25/5/2015', '8/2/2015', '6/6/2015', '15/3/2015', '15/1/2015', '30/5/2015'} 

In [11]: dtes = list(dates1) 

In [12]: s = "the same dates appear in a text ('data' from now on). It's a pretty long text. I want to loop over the text and get the number of times each date appear in the text, then i print the 5 dates with more occurances. " 

In [13]: data = "\n".join([s+ choice(dtes) for _ in range(100000)]) 

In [14]: timeit dates(data,dates1) 
1 loops, best of 3: 662 ms per loop 

如果超过一个日期每行显示您可以使用的findall:

def dates(data, dates1): 
    lines = data.split("\n") 
    r = re.compile("\d+/\d+/\d+") 
    dict_days = Counter(dt for line in lines 
         for dt in r.findall(line) if dt in dates1) 
    return dict_days.most_common(5) 

如果数据实际上不是像对象那样的文件并且是单个字符串,则只需搜索字符串本身:

def dates(data, dates1): 
    r = re.compile("\d+/\d+/\d+") 
    dict_days = Counter((dt for dt in r.findall(data) if dt in dates1)) 
    return dict_days.most_common(5) 

编译对测试数据的日期似乎是最快的方法,分裂各子是非常接近搜索执行:

def dates_split(data, dates1): 
    lines = data.split("\n") 
    dict_days = Counter(dt for line in lines 
         for dt in line.split() if dt in dates1) 
    return dict_days.most_common(5) 

def dates_comp_date1(data, dates1): 
    lines = data.split("\n") 
    r = re.compile("|".join(dates1)) 
    dict_days = Counter(dt for line in lines for dt in r.findall(line)) 
    return dict_days.most_common(5) 

使用上面的功能:

In [63]: timeit dates(data, dates1) 
1 loops, best of 3: 640 ms per loop 

In [64]: timeit dates_split(data, dates1) 
1 loops, best of 3: 535 ms per loop 

In [65]: timeit dates_comp_date1(data, dates1) 
1 loops, best of 3: 368 ms per loop 
+1

平常的优秀答案:P –

+1

看起来不错。让我试试这个,我会让你知道先生。 – NachoMiguel

+0

我不习惯're'但是'r = r.search(line)'?这不会阻止所有行,但第一个被扫描? –

4
Counter(word for word in my_text if word in my_dates) 

我想会很快工作....以及O(N)(ISH)

0

为什么不只是做:

dates = {'21/5/2015':0, '4/4/2015':0, '15/6/2015':0, '30/1/2015':0, '19/3/2015':0, '25/2/2015':0, '25/5/2015':0, '8/2/2015':0, '6/6/2015':0, '15/3/2015':0, '15/1/2015':0, '30/5/2015':0} 

def processDates(data): 
    lines = data.split("\n") 
    for line in lines: 
     if line in dates: 
      dates[line] += 1 

然后只是排序dates按值

1

使用正则表达式来提取数据,以及collections.Counter找到最常见:

import re 
import collections 

def dates(data, dates1): 
    dates1 = '|'.join(x for x in dates1) 
    dates1 = re.findall(dates1, data) 
    dates1 = collections.Counter(dates1) 
    print dates1.most_common(5) 

dates1 = {'21/5/2015', '4/4/2015', '15/6/2015'} 
data = 'Today is 21/5/2015. Yesterday is 4/4/2015.\nMy birthday is 4/4/2015' 

dates(data, dates1)