搜索python中的大文件中的单词列表

我是新的python。我有一个单词列表和一个非常大的文件。我想从文字列表中删除文件中包含单词的行。搜索python中的大文件中的单词列表

单词列表是按排序给出的，可以在初始化时输入。我试图找到解决这个问题的最佳方法。我现在正在进行线性搜索，而且花费的时间太多。

有什么建议吗？

来源

2012-07-13 user1524206

大文件中的行和单词需要以某种方式排序，在这种情况下，您可以执行二进制搜索。它看起来并不像他们所能做的最好的，就是通过检查列表中的每个单词是否在给定的行中来进行线性搜索。

来源

2012-07-13 18:01:20 user1413793

您可以使用集合论中的intersection来检查一行中的单词和单词列表是否有任何共同点。

list_of_words=[] 
sett=set(list_of_words) 
with open(inputfile) as f1,open(outputfile,'w') as f2: 
    for line in f1: 
     if len(set(line.split()).intersection(sett))>=1: 
      pass 
     else: 
      f2.write(line)

来源

2012-07-13 18:03:51

这应该是'open（outputfile，“w”）'。此外，该条件缺少“len”来计算成员的数量;甚至更短的是'set（line.split（））＆sett'。 – MRAB 2012-07-13 18:47:17

@MRAB非常感谢！我完全忘了写这些。我更喜欢'intersection（）'而不是'＆'，因为我总是忘记这些符号。 :) – 2012-07-13 18:57:37

contents = file.read() 
words = the_list.sort(key=len, reverse=True) 
stripped_contents = re.replace(r'^.*(%s).*\n'%'|'.join(words),'',contents)

类似的东西应该工作...不知道这是否是比行通过行会更快

[编辑]这是未经测试的代码，可能需要一些轻微的调整

来源

2012-07-13 18:03:53

您不能就地删除这些行，您需要重写第二个文件。之后您可能会覆盖旧的（请参阅shutil.copy）。

其余倒像是伪代码：

forbidden_words = set("these words shall not occur".split()) 

with open(inputfile) as infile, open(outputfile, 'w+') as outfile: 
    outfile.writelines(line for line in infile 
     if not any(word in forbidden_words for word in line.split()))

的方法如何摆脱标点符号引起的假阴性的见this question。

来源

2012-07-13 19:01:31 moooeeeep

如果源文件只包含由空格分隔的话，你可以使用集：

words = set(your_words_list) 
for line in infile: 
    if words.isdisjoint(line.split()): 
     outfile.write(line)

注意，这不处理标点符号，例如给出words = ['foo', 'bar']像foo, bar,stuff这样的行不会被删除。要处理这个问题，你需要正则表达式：

rr = r'\b(%s)\b' % '|'.join(your_words_list) 
for line in infile: 
    if not re.search(rr, line): 
     outfile.write(line)

来源

2012-07-13 19:24:02 georg

假设文件的大小很大，搜索是否会导致性能问题？Set is operation is good，but the punctuations will not be handled in the case。让我知道你对此的想法。 – user1524206 2012-07-14 16:56:41

搜索python中的大文件中的单词列表

回答

相关问题