我知道OP想创造在NLTK一个TDM,但textmining
包(pip install textmining
)使其变得简单:
import textmining
def termdocumentmatrix_example():
# Create some very short sample documents
doc1 = 'John and Bob are brothers.'
doc2 = 'John went to the store. The store was closed.'
doc3 = 'Bob went to the store too.'
# Initialize class to create term-document matrix
tdm = textmining.TermDocumentMatrix()
# Add the documents
tdm.add_doc(doc1)
tdm.add_doc(doc2)
tdm.add_doc(doc3)
# Write out the matrix to a csv file. Note that setting cutoff=1 means
# that words which appear in 1 or more documents will be included in
# the output (i.e. every word will appear in the output). The default
# for cutoff is 2, since we usually aren't interested in words which
# appear in a single document. For this example we want to see all
# words however, hence cutoff=1.
tdm.write_csv('matrix.csv', cutoff=1)
# Instead of writing out the matrix you can also access its rows directly.
# Let's print them to the screen.
for row in tdm.rows(cutoff=1):
print row
termdocumentmatrix_example()
输出:
['and', 'the', 'brothers', 'to', 'are', 'closed', 'bob', 'john', 'was', 'went', 'store', 'too']
[1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0]
[0, 2, 0, 1, 0, 1, 0, 1, 1, 1, 2, 0]
[0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]
可替代地,人们可以使用熊猫和sklearn [source]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
docs = ['why hello there', 'omg hello pony', 'she went there? omg']
vec = CountVectorizer()
X = vec.fit_transform(docs)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)
输出:
hello omg pony she there went why
0 1 0 0 0 1 0 1
1 1 1 1 0 0 0 0
2 0 1 0 1 1 1 0
您是否尝试过'gensim'或类似的库,已经优化了他们的tf-idf代码? http://radimrehurek.com/gensim/ – alvas 2013-04-09 14:43:07
4000个文件是一个很小的语料库。您需要[稀疏](https://en.wikipedia.org/wiki/Sparse_matrix)表示法。熊猫有Gensim和scikit学习的。 – 2013-04-09 15:03:53
我以为'pd.get_dummies(df_column)'可以完成这项工作。也许我错过了关于文档术语矩阵 – 2015-11-06 05:00:04