2

我在读 Similarity Measure 突然间我的整个世界都崩溃了。我已经使用聚类技术实现了一个搜索引擎。对于聚类,我使用K距离度量作为欧几里德距离。我还使用余弦相似度来显示结果。我得到了惊人的准确结果。但现在我读到了这个,我做的是规范化文档向量并计算两个向量之间的欧式距离,因此我没有考虑到任何地方的大小。欧几里德距离或余弦相似度?

我做错了什么?

虽然我认为较高的期限频率可以弥补较高的tf-idf值和较高的归一化tf-idf值,因此应该适当地排名较高。 由于

结果(使用不归一化矢量,附图欧几里德距离)

61.79689257425985 222Proposed Research Details.doc 
144.15451315901478 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_alternate_sustainable_land_management_options_for_tribal_dominated_watersheds_RRPS_24.doc 
72.61392308146608 done_Developing live fencing systems for soil & water conservation_NATIP-RNPS-3 SKN Math).doc 
72.96125277156261 done_Management strategies for impriing rabi (SKN Math).doc 
65.51734241367222 done_RPFIII_dr.dogra.doc 
66.72042766100921 Evaluation of crops and their varieties (SKN Math).doc 
418.8868087170988 P. VIJAYA KUMAR (DSS).doc 
140.3914521621597 RPF - I PIMS-ICAR project proposal for IASRI.doc 
72.95414421468679 RPF-III__Indo-US_project.doc 
82.25126123574397 220Introduction and objectives.doc 

结果(归一化矢量,附图欧几里德距离)

1.3435369899385359 222Proposed Research Details.doc 
1.1277471087250086 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_alternate_sustainable_land_management_options_for_tribal_dominated_watersheds_RRPS_24.doc 
1.2741267093494966 done_Developing live fencing systems for soil & water conservation_NATIP-RNPS-3 SKN Math).doc 
1.264154265747389 done_Management strategies for impriing rabi (SKN Math).doc 
1.2902191708899362 done_RPFIII_dr.dogra.doc 
1.3128744973475515 Evaluation of crops and their varieties (SKN Math).doc 
0.4924243033927417 P. VIJAYA KUMAR (DSS).doc 
1.1747048933792805 RPF - I PIMS-ICAR project proposal for IASRI.doc 
1.29150899172647 RPF-III__Indo-US_project.doc 
1.318016051789028 220Introduction and objectives.doc 

结果(数字余弦相似度)

0.09745417833344654 222Proposed Research Details.doc 
0.36409322938119104 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_alternate_sustainable_land_management_options_for_tribal_dominated_watersheds_RRPS_24.doc 
0.1883005642611103 done_Developing live fencing systems for soil & water conservation_NATIP-RNPS-3 SKN Math).doc 
0.2009569961963377 done_Management strategies for impriing rabi (SKN Math).doc 
0.16766724553404047 done_RPFIII_dr.dogra.doc 
0.13818027710720598 Evaluation of crops and their varieties (SKN Math).doc 
0.8787591527140649 P. VIJAYA KUMAR (DSS).doc 
0.3100342067353838 RPF - I PIMS-ICAR project proposal for IASRI.doc 
0.16600226214483405 RPF-III__Indo-US_project.doc 
0.13141684361322944 220Introduction and objectives.doc 

结果1和2不同意,而2和3则强烈。更多的相似性,更小的距离。集群质心向量与每个文档的文档向量之间的距离。

事实上,最奇怪的结果是欧几里德距离为418,最相似度为0.87的文件。而归一化距离变为0.49并与相似性相符。

+0

关于统计:http://stats.stackexchange.com/questions/35076/euclidean-distance-euclidean-distance-between-unit-vectors-or-cosine-similarity –

+0

这个问题已经被交叉发表在[Cross验证](http://stats.stackexchange.com/questions/35076/euclidean-distance-bt-unit-vectors-or-cosine-similarity-where-vectors-are-docum),它更适合。 – BoltClock

回答

0

当我从我的信息回顾讲座中记住时,两个向量的归​​一化导致了欧氏距离以及余弦相似度的反向排序。