比较文本文件内容的最快方法

我有一个问题可以帮助简化我的编程。所以我有这个文件text.txt，在这个文件中，我想查看它，并将它与单词列表words进行比较，并且每次找到该单词时，它都会将1添加到整数。比较文本文件内容的最快方法

words = ['the', 'or', 'and', 'can', 'help', 'it', 'one', 'two'] 
ints = [] 
with open('text.txt') as file: 
    for line in file: 
     for part in line.split(): 
      for word in words: 
       if word in part: 
        ints.append(1)

我只是想知道是否有更快的方法来做到这一点？文本文件可能会更大，单词列表会更大。

来源

2015-06-07 user1985351

你想找到比赛的数量吗？ – thefourtheye

您可以将words转换为set，使查找会更快。这应该会提高程序的性能，因为查找列表中的值必须一次遍历列表中的一个元素（O（n）运行时复杂度），但是当您将列表转换为集合时，运行时复杂度将降低到O（1）（恒定时间）。因为集合使用散列来查找元素。

words = {'the', 'or', 'and', 'can', 'help', 'it', 'one', 'two'}

然后每当有比赛，你可以使用sum函数来计算它像这样

布尔值及其整数等效

在Python，布尔表达式的结果将等于的0或1分别为和True。

>>> True == 1 
True 
>>> False == 0 
True 
>>> int(True) 
1 
>>> int(False) 
0 
>>> sum([True, True, True]) 
3 
>>> sum([True, False, True]) 
2

所以每当你是否part in words，则结果可能是0或1，我们sum所有这些值。

上方所看到的代码是功能上等同于

result = 0 
with open('text.txt') as file: 
    for line in file: 
     for part in line.split(): 
      if part in words: 
       result += 1

注：如果你真的想在每当有一个匹配列表以获得1的，那么你可以简单地将生成器表达式转换为sum以获得列表理解，如下所示：

with open('text.txt') as file: 
    print([int(part in words) for line in file for part in line.split()])

字

频率

如果你真的想找到的个别单词的频率在words，那么你可以使用collections.Counter这样

from collections import Counter 
with open('text.txt') as file: 
    c = Counter(part for line in file for part in line.split() if part in words)

这将内部统计数文件中出现words中的每个单词的时间。

按the comment，可以有你的字典，您可以存储正话正分数，并以负分否定词，并指望他们这样

words = {'happy': 1, 'good': 1, 'great': 1, 'no': -1, 'hate': -1} 
with open('text.txt') as file: 
    print(sum(words.get(part, 0) for line in file for part in line.split()))

在这里，我们使用words.get词典为了获得存储在单词中的值，并且如果在词典中找不到该单词（既不是好词也不是坏词），则返回默认值0。

来源

2015-06-07 14:58:26 thefourtheye

感谢你们，我在这里列出了所有功能的'timeit'，你的速度是最快的。还有我为什么要做'1'。我比较文章是否是正面或负面的文章。所以如果有一个正面的词，它会放一个'1'，如果是负数，那么'-1'。然后总结它并显示文章是否有正面或负面的语气。再次感谢！ – user1985351

@ user1985351好的，我提供了一种方法来解决您尝试解决的实际问题。让我知道它是否有帮助，否则我会删除它。另外，请在问题本身中包含所有这些信息。这将有助于未来的读者。 – thefourtheye

您可以使用set.intersection找到一组和列表之间的交集，从而更有效的方式把内set你的话和做的事：

words={'the','or','and','can','help','it','one','two'} 
ints=[] 
with open('text.txt') as f: 
    for line in f: 
     for _ in range(len(words.intersection(line.split()))): 
       ints.append(1)

注意前面的解决方案是基于你的代码，你将1添加到列表中。你想找到的最终计数可以内sum用生成器表达式：

words={'the','or','and','can','help','it','one','two'} 
with open('text.txt') as f: 
    sum(len(words.intersection(line.split())) for line in f)

来源

2015-06-07 14:57:45 Kasramvd

比较文本文件内容的最快方法

回答

相关问题