Gensim Doc2Vec模型只生成有限数量的向量

我正在使用gensim Doc2Vec模型来生成我的特征向量。这里是我使用的代码（我已经解释了我的问题是在代码是什么）：Gensim Doc2Vec模型只生成有限数量的向量

cores = multiprocessing.cpu_count() 

# creating a list of tagged documents 
training_docs = [] 

# all_docs: a list of 53 strings which are my documents and are very long (not just a couple of sentences) 
for index, doc in enumerate(all_docs): 
    # 'doc' is in unicode format and I have already preprocessed it 
    training_docs.append(TaggedDocument(doc.split(), str(index+1))) 

# at this point, I have 53 strings in my 'training_docs' list 

model = Doc2Vec(training_docs, size=400, window=8, min_count=1, workers=cores) 

# now that I print the vectors, I only have 10 vectors while I should have 53 vectors for the 53 documents that I have in my training_docs list. 
print(len(model.docvecs)) 
# output: 10

我只是想知道或者如果我做了一个错误，如果有任何其他的参数，我应该设置？

更新：我是用标签打参数TaggedDocument，当我改成了文字和数字的混合物等：文档1，文档2，...我看到生成的向量的数量不同的数字，但仍然没有预期的特征向量数量相同。

来源

2017-08-02 Pedram

看看它在你的阴茎已经发现实际标签：

print(model.docvecs.offset2doctag)

你看到一个模式？

每个文档的tags属性应该是标签一个列表，而不是一个单一的标签。如果您提供一个简单的整数字符串，它会将其看作一个数字列表，因此只能学习标签'0','1'，...，'9'。

您可以用代替str(index+1)并获得您期望的行为。

但是，由于您的文档ID只是升序整数，您也可以使用普通的Python ints作为您的doctag。这将节省一些内存，避免从string-tag到array-slot（int）的查找字典的创建。为此，请将str(index+1)替换为[index]。（这会从0开始doc-IDs--这是一个比tethy更多的Pythonic，并且还避免浪费未使用的0在保存训练好的向量的原始数组中的位置。）

来源

2017-08-03 01:29:10 gojomo

Gensim Doc2Vec模型只生成有限数量的向量

回答

相关问题