我正在使用gensim
库通过doc2vec
模型在Python中构建NLP聊天应用程序。我有硬编码的文档并给出了一组训练示例,我通过抛出用户问题来测试模型,然后找到大多数类似的文档作为第一步。在这种情况下,我的测试问题是来自培训示例的文档的精确副本。如何提高doc2vec模型中两个文档(句子)的余弦相似度?
import gensim
from gensim import models
sentence = models.doc2vec.LabeledSentence(words=[u'sampling',u'what',u'is',u'tell',u'me',u'about'],tags=["SENT_0"])
sentence1 = models.doc2vec.LabeledSentence(words=[u'eligibility',u'what',u'is',u'my',u'limit',u'how',u'much',u'can',u'I',u'claim'],tags=["SENT_1"])
sentence2 = models.doc2vec.LabeledSentence(words=[u'eligibility',u'I',u'am',u'retiring',u'how',u'much',u'can',u'claim',u'have', u'resigned'],tags=["SENT_2"])
sentence3 = models.doc2vec.LabeledSentence(words=[u'what',u'is',u'my',u'eligibility',u'post',u'my',u'promotion'],tags=["SENT_3"])
sentence4 = models.doc2vec.LabeledSentence(words=[u'what',u'is', u'my',u'eligibility' u'post',u'my',u'promotion'], tags=["SENT_4"])
sentences = [sentence, sentence1, sentence2, sentence3, sentence4]
class LabeledLineSentence(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
for uid, line in enumerate(open(filename)):
yield LabeledSentence(words=line.split(), labels=['SENT_%s' % uid])
model = models.Doc2Vec(alpha=0.03, min_alpha=.025, min_count=2)
model.build_vocab(sentences)
for epoch in range(30):
model.train(sentences, total_examples=model.corpus_count, epochs = model.iter)
model.alpha -= 0.002 # decrease the learning rate`
model.min_alpha = model.alpha # fix the learning rate, no decay
model.save("my_model.doc2vec")
model_loaded = models.Doc2Vec.load('my_model.doc2vec')
print (model_loaded.docvecs.most_similar(["SENT_4"]))
结果:
[('SENT_1', 0.043695494532585144), ('SENT_2', 0.0017897281795740128), ('SENT_0', -0.018954679369926453), ('SENT_3', -0.08253869414329529)]
的SENT_4
和SENT_3
相似度只有-0.08253869414329529
当它应该是1,因为它们是完全一样的。我应该如何提高这个准确度?有没有培训文件的具体方式,我错过了什么?
所有SENT_4和SENT_3首先是不完全一样的。看看单词“promotion” –
纠正后:[('SENT_1',0.043695494532585144),('SENT_2',0.0017897281795740128),('SENT_0',-0.018954679369926453),('SENT_3',-0.08253869414329529)] – Kshitiz