python的tfidf算法

我有这个代码用于计算与tf-idf的文本相似度。python的tfidf算法

from sklearn.feature_extraction.text import TfidfVectorizer 

documents = [doc1,doc2] 
tfidf = TfidfVectorizer().fit_transform(documents) 
pairwise_similarity = tfidf * tfidf.T 
print pairwise_similarity.A

的问题是，这个代码采取作为输入字符串平原，我想通过删除停用词，词干和tokkenize准备文件。所以输入将是一个列表。如果我叫了documents = [doc1,doc2]与tokkenized文件的错误是：

Traceback (most recent call last): 
    File "C:\Users\tasos\Desktop\my thesis\beta\similarity.py", line 18, in <module> 
    tfidf = TfidfVectorizer().fit_transform(documents) 
    File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 1219, in fit_transform 
    X = super(TfidfVectorizer, self).fit_transform(raw_documents) 
    File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 780, in fit_transform 
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary) 
    File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 715, in _count_vocab 
    for feature in analyze(doc): 
    File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 229, in <lambda> 
    tokenize(preprocess(self.decode(doc))), stop_words) 
    File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 195, in <lambda> 
    return lambda x: strip_accents(x.lower()) 
AttributeError: 'unicode' object has no attribute 'apply_freq_filter'

有没有办法修改代码，使其接受列表或有我再次更改tokkenized文件字符串？

来源

2013-08-25 Tasos

看起来你错过了实际的错误信息（你已经包含了回溯，但没有引发错误）。 –

糟糕。我编辑它。 – Tasos

@Tasos我的答案是否奏效，还是您还有问题？如果我的解决方案不起作用，您可以举一个'doc1' /'doc2'的简单例子吗？ – chlunde

尝试跳过预处理，以小写并提供自己的“NOP”标记者：

tfidf = TfidfVectorizer(tokenizer=lambda doc: doc, lowercase=False).fit_transform(documents)

你也应该看看其他参数，如stop_words以避免重复你的预处理。

来源

2013-08-25 23:06:18 chlunde

python的tfidf算法

回答

相关问题