从文件中删除单词

我试图将一个单独的文件（停用词）中包含要被删除的单词用回车符（“\ n”）分隔的单词中标识的单词删除。从文件中删除单词

现在我将两个文件转换为列表，以便可以比较每个列表的元素。我有这个功能可以工作，但它并没有删除我在停用词文件中指定的所有单词。任何帮助是极大的赞赏。

def elimstops(file_str): #takes as input a string for the stopwords file location 
    stop_f = open(file_str, 'r') 
    stopw = stop_f.read() 
    stopw = stopw.split('\n') 
    text_file = open('sample.txt') #Opens the file whose stop words will be eliminated 
    prime = text_file.read() 
    prime = prime.split(' ') #Splits the string into a list separated by a space 
    tot_str = "" #total string 
    i = 0 
    while i < (len(stopw)): 
    if stopw[i] in prime: 
     prime.remove(stopw[i]) #removes the stopword from the text 
    else: 
     pass 
    i += 1 
    # Creates a new string from the compilation of list elements 
    # with the stop words removed 
    for v in prime: 
    tot_str = tot_str + str(v) + " " 
    return tot_str

来源

2012-10-22 user1765792

下面是使用一个发电机表达的替代解决方案。

tot_str = ' '.join(word for word in prime if word not in stopw)

为了使这更有效，使用stopw = set(stopw)转stopw成set。

你可能会遇到与当前方法的问题，如果sample.txt的不只是一个空格分隔的文件，例如，如果你有一个标点符号一般的句子，然后在空间拆分将离开标点符号作为单词的一部分。为了解决这个问题，你可以使用re模块上的空白或标点符号分割你的字符串：

import re 
prime = re.split(r'\W+', text_file.read())

来源

2012-10-22 16:49:14

我不认为这是必要的 - 他正在迭代'stopw'并从'prime'中删除元素 –

@SamMussmann谢谢，我刚刚注意到了这一点。用标点符号导致OP的问题的理论编辑了我的答案。 –

我不知道蟒蛇，但这里是做一般的方式是为O（n）+ O（ m）时间 - 线性。

1：将来自stopwords文件的所有单词添加到地图中。
2：阅读您的常规文本文件，并尝试将单词添加到列表中。虽然你做＃2检查当前阅读的单词是否在地图中，如果它是跳过它，否则将其添加到列表。

最后，列表中应该包含您需要的所有单词 - 您希望删除的单词。

来源

2012-10-22 16:57:07 Adrian

我觉得你的问题是，这条线：

只会从prime去除stopw[i]第一次出现。为了解决这个问题，你应该这样做：

while stopw[i] in prime: 
     prime.remove(stopw[i]) #removes the stopword from the text

然而，这是怎么回事运行非常缓慢，因为无论是in prime和prime.remove位将不得不遍历素数。这意味着您将以字符串的长度结束quadratic运行时间。如果您使用像F.J. suggests这样的发电机，您的运行时间将是线性的，这会更好。

来源

2012-10-22 16:58:40

从文件中删除单词

回答

相关问题