从蟒蛇

一个计数器我有在NLTK一个函数生成一个索引表，这看起来像从蟒蛇

concordanceList = ["this is a concordance string something", 
       "this is another concordance string blah"]

删除停用词的列表，我有一个返回柜台字典的另一个功能在concordanceList每个单词的计数

def mostCommonWords(concordanceList): 
    finalCount = Counter() 
    for line in concordanceList: 
    words = line.split(" ") 
    currentCount = Counter(words) 
    finalCount.update(currentCount) 
    return finalCount

我的问题是如何最好地从产生反删除停用词，所以，当我打电话

mostCommonWords(concordanceList).most_common(10)

结果不仅仅是{“the”：100，“是”：78，“that”：57}。

我认为对文本进行预处理以删除停用词已不存在，因为我仍然需要将一致字符串作为语法语言的实例。基本上，我问，如果有这样做比创建停用词停用字词计数器一个简单的方法，设置较低的值，然后让另一个柜台，像这样：

stopWordCounter = Counter(the=1, that=1, so=1, and=1) 
processedWordCounter = mostCommonWords(concordanceList) & stopWordCounter

这应该设置的计数值所有的停用词到1，但它似乎hacky。

编辑：此外，我实际上在制作stopWordCounter时遇到了问题，因为如果我想包含“and”之类的保留字，我会得到无效的语法错误。计数器具有易于使用的联合和交集方法，这将使任务相当简单;是否有等同的字典方法？

来源

2013-12-21 TuringTested

RE：关于无效语法错误的编辑。 '和'保留，但''和''是一个字符串。你应该使用'Counter（[“和”]）'来创建一个带有字符串'“和”'的计数器。 – ChrisP

您可以在标志化过程去除停止词...

stop_words = frozenset(['the', 'a', 'is']) 
def mostCommonWords(concordanceList): 
    finalCount = Counter() 
    for line in concordanceList: 
     words = [w for w in line.split(" ") if w not in stop_words] 
     finalCount.update(words) # update final count using the words list 
    return finalCount

来源

2013-12-21 20:23:32 ChrisP

太好了，谢谢。确实是我以后的事情。 – TuringTested

首先，你不需要创建所有这些新Counter是你的函数内;你可以这样做：

for line in concordanceList: 
    finalCount.update(line.split(" "))

改为。

第二，Counter是一种词典，这样你就可以直接删除项目：

for sword in stopwords: 
    del yourCounter[sword]

不要紧sword是否为Counter - 这不会引发异常不管。

来源

2013-12-21 20:24:32

您有几个选项。

一，更新时不计禁用词你Counter - 你可以做的更简洁，因为Counter对象可以接受update可迭代以及另一映射：

def mostCommonWords(concordanceList): 
    finalCount = Counter() 
    stopwords = frozenset(['the', 'that', 'so']) 
    for line in concordanceList: 
     words = line.strip().split(' ') 
     finalCount.update([word for word in words if word not in stopwords]) 
    return finalCount

或者，您可以完成后，使用del实际将其从Counter中删除。

我还在split之前在line上加了strip。如果您要使用split()以及在所有空格上分割的默认行为，则不需要这样做，但split(' ')不会将换行符视为要分割的内容，因此每行的最后一个单词将会有尾部\n，并且会被认为不同于任何其他外观。 strip摆脱那。

来源

2013-12-21 20:24:51

真棒，来自Javascript，所以我不知道一个空分割（）可能是如此有用... – TuringTested

我会去压扁的项目进言，忽略任何停止词，并提供输入到一个Counter来代替：

：

from collections import Counter 
from itertools import chain 

lines = [ 
    "this is a concordance string something", 
    "this is another concordance string blah" 
] 

stops = {'this', 'that', 'a', 'is'}  
words = chain.from_iterable(line.split() for line in lines) 
count = Counter(word for word in words if word not in stops)

或者说，最后一点，可以为已完成

from itertools import ifilterfalse 
count = Counter(ifilterfalse(stops.__contains__, words))

来源

2013-12-21 20:28:02

如何：

if 'the' in counter: 
    del counter['the']

来源

2013-12-21 20:30:22 dstromberg

这工作，但停用词表将是100个字左右，所以我不能输出我想删除/忽略每个单词的条件。谢谢你。 – TuringTested

就个人而言，我认为@JonClements的回答是最埃尔egant。顺便说一句，已经有在NLTK stopwords名单，以防万一OP不知道，看到NLTK stopword removal issue

from collections import Counter 
from itertools import chain 
from nltk.corpus import stopwords 

lines = [ 
    "this is a concordance string something", 
    "this is another concordance string blah" 
] 

stops = stopwords.words('english') 
words = chain.from_iterable(line.split() for line in lines) 
count = Counter(word for word in words if word not in stops) 
count = Counter(ifilterfalse(stops.__contains__, words))

而且，FreqDist模块中NLTK有更多的NLP相关的功能相比，collections.Counter。 http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html

来源

2013-12-30 15:22:22 alvas

回答

相关问题