2017-09-04 33 views
0

我试图在NLTK电影运行和实例CountVectorizer()评论文集,使用下面的代码:CountVectorizer():StreamBackedCorpusView”对象有没有属性‘低’

>>>import nltk 
>>>import nltk.corpus 
>>>from sklearn.feature_extraction.text import CountVectorizer 
>>>from nltk.corpus import movie_reviews 
>>>neg_rev = movie_reviews.fileids('neg') 
>>>pos_rev = movie_reviews.fileids('pos') 
>>>rev_list = [] # Empty List 
>>>for rev in neg_rev: 
    rev_list.append(nltk.corpus.movie_reviews.words(rev)) 
>>>for rev_pos in pos_rev: 
    rev_list.append(nltk.corpus.movie_reviews.words(rev_pos)) 
>>>count_vect = CountVectorizer() 
>>>X_count_vect = count_vect.fit_transform(rev_list) 

我收到以下错误:

AttributeError       Traceback (most recent call last) 
<ipython-input-37-00e9047daa67> in <module>() 
----> 1 X_count_vect = count_vect.fit_transform(rev_list) 

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y) 
    837 
    838   vocabulary, X = self._count_vocab(raw_documents, 
--> 839           self.fixed_vocabulary_) 
    840 
    841   if self.binary: 

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab) 
    760   for doc in raw_documents: 
    761    feature_counter = {} 
--> 762    for feature in analyze(doc): 
    763     try: 
    764      feature_idx = vocabulary[feature] 

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(doc) 
    239 
    240    return lambda doc: self._word_ngrams(
--> 241     tokenize(preprocess(self.decode(doc))), stop_words) 
    242 
    243   else: 

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(x) 
    205 
    206   if self.lowercase: 
--> 207    return lambda x: strip_accents(x.lower()) 
    208   else: 
    209    return strip_accents 

AttributeError: 'StreamBackedCorpusView' object has no attribute 'lower' 

nltk.corpus.movie_reviews.words(rev_pos)已标记化的句子....如:

['films', 'adapted', 'from', 'comic', 'books', 'have', ...] 

任何人都可以请告诉我我做错了什么?我假设我在创建电影评论的(rev_list)列表中进行了一些尝试。

TIA

+0

您应该检查类型'nltk.corpus.movi​​e_reviews.words(rev_pos)'你是追加到列表中。它应该是一个由CountVectorizer处理的字符串,我不认为它是当前的。 –

回答

1

它看起来像你的.words()函数实际上不是给你回令牌的列表,而是一系列StreamBackedCorpusView类。该类允许您检索令牌,但实际上并不是令牌本身的完整表示。

但是,您可以从视图中检索令牌。有关使用StreamBackCorpusView的更多详细信息,请参阅以下链接。

http://nltk.sourceforge.net/corpusview/corpusview.StreamBackedCorpusView-class.html