2014-12-02 45 views


from sklearn.feature_extraction.text import CountVectorizer 

corpus =['''Computer science is the scientific and 
practical approach to computation and its applications.''' 
#this is another opinion 
'''It is the systematic study of the feasibility, structure, 
expression, and mechanization of the methodical 
procedures that underlie the acquisition, 
representation, processing, storage, communication of, 
and access to information, whether such information is encoded 
as bits in a computer memory or transcribed in genes and 
protein structures in a biological cell.''' 
'''A computer scientist specializes in the theory of 
computation and the design of computational systems'''] 

vectorizer = CountVectorizer(analyzer='word') 

X = vectorizer.fit_transform(corpus) 

print X 


(0, 12) 3 
    (0, 33) 1 
    (0, 20) 3 
    (0, 45) 7 
    (0, 34) 1 
    (0, 2) 6 
    (0, 28) 1 
    (0, 4) 1 
    (0, 47) 2 
    (0, 10) 2 
    (0, 22) 1 
    (0, 3) 1 
    (0, 21) 1 
    (0, 42) 1 
    (0, 40) 1 
    (0, 26) 5 
    (0, 16) 1 
    (0, 38) 1 
    (0, 15) 1 
    (0, 23) 1 
    (0, 25) 1 
    (0, 29) 1 
    (0, 44) 1 
    (0, 49) 1 
    (0, 1) 1 
    : : 
    (0, 30) 1 
    (0, 37) 1 
    (0, 9) 1 
    (0, 0) 1 
    (0, 19) 2 
    (0, 50) 1 
    (0, 41) 1 
    (0, 14) 1 
    (0, 5) 1 
    (0, 7) 1 
    (0, 18) 4 
    (0, 24) 1 
    (0, 27) 1 
    (0, 48) 1 
    (0, 17) 1 
    (0, 31) 1 
    (0, 39) 1 
    (0, 6) 1 
    (0, 8) 1 
    (0, 35) 1 
    (0, 36) 1 
    (0, 46) 1 
    (0, 13) 1 
    (0, 11) 1 
    (0, 43) 1 


print X.toarray() 


[[1 1 6 1 1 1 1 1 1 1 2 1 3 1 1 1 1 1 4 2 3 1 1 1 1 1 5 1 1 1 1 1 1 1 1 1 1 
    1 1 1 1 1 1 1 1 7 1 2 1 1 1]] 

您可能想了解Manning&Schuetze书中的向量空间模型:http://nlp.stanford.edu/IR-book/pdf/06vect.pdf – mbatchkarov 2014-12-02 13:51:05




from sklearn.feature_extraction.text import CountVectorizer 

corpus =['''computer hardware''', 
'''computer data and software data'''] 

vectorizer = CountVectorizer(analyzer='word') 

X = vectorizer.fit_transform(corpus) 

print X 

print X.toarray() 


 | and computer data hardware software 
doc 0 |   1    1 
doc 1 | 1  1 2     1 


(1, 0) 1 
    (0, 1) 1 
    (1, 1) 1 
    (1, 2) 2 
    (0, 3) 1 
    (1, 4) 1 
[[0 1 0 1 0] 
[1 1 2 0 1]] 



感谢您的反馈意见。关于scikit-learn的其他矢量工具有哪些? (例如FeatureHasher,Tf-idf等),这种矢量化算法是否返回文档矩阵或返回的矩阵取决于所选择的矢量化算法?。 – tumbleweed 2014-12-03 05:44:57


@ml_guy是的,它取决于向量化器和参数。请看一下[功能提取页面](http://scikit-learn.org/stable/modules/feature_extraction.html)。 – 2014-12-03 23:53:38



>>> vectorizer.get_feature_names() 
[u'access', u'acquisition', u'and', u'applications', u'approach', u'as', u'biological', u'bits', u'cell', u'communication', u'computation', u'computational', u'computer', u'design', u'encoded', u'expression', u'feasibility', u'genes', u'in', u'information', u'is', u'it', u'its', u'mechanization', u'memory', u'methodical', u'of', u'or', u'practical', u'procedures', u'processing', u'protein', u'representation', u'science', u'scientific', u'scientist', u'specializes', u'storage', u'structure', u'structures', u'study', u'such', u'systematic', u'systems', u'that', u'the', u'theory', u'to', u'transcribed', u'underlie', u'whether'] 


顺便说一句,混淆的一点可能是#anotherone附近缺失的逗号 - 这会导致两个字符串被连接,因此corpus只是一个列表,其中只有一个字符串。