首先让我们提取每学期的TF-IDF得分每份文件:解读跨文档的单词的TF-IDF得分的总和
from gensim import corpora, models, similarities
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
打印出来:
for doc in corpus_tfidf:
print doc
[OUT]:
[(0, 0.4301019571350565), (1, 0.4301019571350565), (2, 0.4301019571350565), (3, 0.4301019571350565), (4, 0.2944198962221451), (5, 0.2944198962221451), (6, 0.2944198962221451)]
[(4, 0.3726494271826947), (7, 0.27219160459794917), (8, 0.3726494271826947), (9, 0.27219160459794917), (10, 0.3726494271826947), (11, 0.5443832091958983), (12, 0.3726494271826947)]
[(6, 0.438482464916089), (7, 0.32027755044706185), (9, 0.32027755044706185), (13, 0.6405551008941237), (14, 0.438482464916089)]
[(5, 0.3449874408519962), (7, 0.5039733231394895), (14, 0.3449874408519962), (15, 0.5039733231394895), (16, 0.5039733231394895)]
[(9, 0.21953536176370683), (10, 0.30055933182961736), (12, 0.30055933182961736), (17, 0.43907072352741366), (18, 0.43907072352741366), (19, 0.43907072352741366), (20, 0.43907072352741366)]
[(21, 0.48507125007266594), (22, 0.48507125007266594), (23, 0.48507125007266594), (24, 0.48507125007266594), (25, 0.24253562503633297)]
[(25, 0.31622776601683794), (26, 0.31622776601683794), (27, 0.6324555320336759), (28, 0.6324555320336759)]
[(25, 0.20466057569885868), (26, 0.20466057569885868), (29, 0.2801947048062438), (30, 0.40932115139771735), (31, 0.40932115139771735), (32, 0.40932115139771735), (33, 0.40932115139771735), (34, 0.40932115139771735)]
[(8, 0.6282580468670046), (26, 0.45889394536615247), (29, 0.6282580468670046)]
如果我们要找到“显着性”或者这个躯体内的话,“重要性”,CAN我们简单地将所有文档中tf-idf分数的总和除以文档数量? I.e.
>>> tfidf_saliency = Counter()
>>> for doc in corpus_tfidf:
... for word, score in doc:
... tfidf_saliency[word] += score/len(corpus_tfidf)
...
>>> tfidf_saliency
Counter({7: 0.12182694202050007, 8: 0.11121194156107769, 26: 0.10886469856464989, 29: 0.10093919463036093, 9: 0.09022272408985754, 14: 0.08705221175200946, 25: 0.08482488519466996, 6: 0.08143359568202602, 10: 0.07480097322359022, 12: 0.07480097322359022, 4: 0.07411881371164887, 13: 0.07117278898823597, 5: 0.07104525967490458, 27: 0.07027283689263066, 28: 0.07027283689263066, 11: 0.060487023243988705, 15: 0.055997035904387725, 16: 0.055997035904387725, 21: 0.05389680556362955, 22: 0.05389680556362955, 23: 0.05389680556362955, 24: 0.05389680556362955, 17: 0.048785635947490406, 18: 0.048785635947490406, 19: 0.048785635947490406, 20: 0.048785635947490406, 0: 0.04778910634833961, 1: 0.04778910634833961, 2: 0.04778910634833961, 3: 0.04778910634833961, 30: 0.045480127933079706, 31: 0.045480127933079706, 32: 0.045480127933079706, 33: 0.045480127933079706, 34: 0.045480127933079706})
望着输出,我们可以假设,在语料库中最“显眼”的字是:
>>> dictionary[7]
u'system'
>>> dictionary[8]
u'survey'
>>> dictionary[26]
u'graph'
如果是这样,什么是TF-IDF的总和的数学解释文档中的分数是多少?
你可以请你的'字典'的输出附加到你的问题。我想和我的'词典'做比较,这样我就可以在我的答案中更新我的输出表。 – stovfl
哎呀,对不起,我没有保存它。字典会有所不同,因为我使用的是Python3,而且字典不一样,如果我重新运行它。但是单词的排名应该是确定性的,因为它基于静态计数,重新运行gensim代码,您应该得到与前3相同的“系统,调查,图表”。 – alvas
对不起,不能使用gensim。 – stovfl