弹性搜索中“_id”字段对搜索方法的影响？

我在弹性搜索方面遇到了一些麻烦......我设法在我的机器上创建了一个可重复使用的示例，代码位于帖子末尾。弹性搜索中“_id”字段对搜索方法的影响？

我只是创建6个用户，"Roger Sand"，"Roger Gilbert"，"Cindy Sand"，"Cindy Gilbert"，"Jean-Roger Sands"，"Sand Roger"，并通过其名称索引它。

然后我运行一个查询来匹配“Roger Sand”，并显示相关的分数。

下面是执行相同的脚本，其中有两组differents ID：84046到84051和84047到84052（刚刚移位1）。

结果不是以相同的顺序，并有不一样的比分：

执行与84046 ... 84051

Sand Roger => 0.8838835 
Roger Sand => 0.2712221 
Cindy Sand => 0.22097087 
Jean-Roger Sands => 0.17677669 
Roger Gilbert => 0.028130025

执行与84047..84052

Roger Sand => 0.2712221 
Sand Roger => 0.2712221 
Cindy Sand => 0.22097087 
Jean-Roger Sands => 0.17677669 
Roger Gilbert => 0.15891947

我的问题是为什么“id”对搜索有影响通过“full_name”？

这是一个完整的可复制脚本的ruby代码。

first_id = 84046 # Or 84047 
client = Elasticsearch::Client.new(:log => true) 
client.transport.reload_connections! 
client.indices.delete({:index => 'test'}) 
client.indices.create({ :index => 'test' }) 
client.perform_request('POST', 'test/_refresh') 

["Roger Sand", "Roger Gilbert", "Cindy Sand", "Cindy Gilbert", "Jean-Roger Sands", "Sand Roger" ].each_with_index do |name, i| 
    i2 = first_id + i 
    client.create({ 
    :index => 'test', :type => 'user', 
    :id => i2, 
    :body => { :full_name => name } 
    }) 
end 

query_options = { 
    :type => 'user', :index => 'test', 
    :body => { 
    :query => { :match => { :full_name => "Roger Sand" } } 
    } 
} 

client.perform_request('POST', 'test/_refresh') 

client.search(query_options)["hits"]["hits"].each do |hit| 
    $stderr.puts "#{hit["_source"]["full_name"]} => #{hit["_score"]}" 
end

这里是一个命令行

curl -XDELETE 'http://localhost:9200/test' 
curl -XPUT 'http://localhost:9200/test' 
curl -XPOST 'http://localhost:9200/test/_refresh' 
curl -XPUT 'http://localhost:9200/test/user/84047?op_type=create' -d '{"full_name":"Roger Sand"}' 
curl -XPUT 'http://localhost:9200/test/user/84048?op_type=create' -d '{"full_name":"Roger Gilbert"}' 
curl -XPUT 'http://localhost:9200/test/user/84049?op_type=create' -d '{"full_name":"Cindy Sand"}' 
curl -XPUT 'http://localhost:9200/test/user/84050?op_type=create' -d '{"full_name":"Cindy Gilbert"}' 
curl -XPUT 'http://localhost:9200/test/user/84051?op_type=create' -d '{"full_name":"Jean-Roger Sands"}' 
curl -XPUT 'http://localhost:9200/test/user/84052?op_type=create' -d '{"full_name":"Sand Roger"}' 
curl -XPOST 'http://localhost:9200/test/_refresh' 
curl -XPOST 'http://localhost:9200/test/user/_search?pretty' -d '{"query":{"match":{"full_name":"Roger Sand"}}}' 


curl -XDELETE 'http://localhost:9200/test' 
curl -XPUT 'http://localhost:9200/test' 
curl -XPOST 'http://localhost:9200/test/_refresh' 
curl -XPUT 'http://localhost:9200/test/user/84046?op_type=create' -d '{"full_name":"Roger Sand"}' 
curl -XPUT 'http://localhost:9200/test/user/84047?op_type=create' -d '{"full_name":"Roger Gilbert"}' 
curl -XPUT 'http://localhost:9200/test/user/84048?op_type=create' -d '{"full_name":"Cindy Sand"}' 
curl -XPUT 'http://localhost:9200/test/user/84049?op_type=create' -d '{"full_name":"Cindy Gilbert"}' 
curl -XPUT 'http://localhost:9200/test/user/84050?op_type=create' -d '{"full_name":"Jean-Roger Sands"}' 
curl -XPUT 'http://localhost:9200/test/user/84051?op_type=create' -d '{"full_name":"Sand Roger"}' 
curl -XPOST 'http://localhost:9200/test/_refresh' 
curl -XPOST 'http://localhost:9200/test/user/_search?pretty' -d '{"query":{"match":{"full_name":"Roger Sand"}}}'

来源

2014-01-29 pierallard

问题在于分布式分数计算。

您使用默认设置创建了一个新的索引，即5个分片。每个分片都是它自己的Lucene索引。当您为数据建立索引时，Elasticsearch需要决定文档应该到哪个分片，并且通过对_id进行散列（在没有路由参数的情况下）。

因此，通过移动ID，您最终将文档分发给不同的分片。如上所述，每个分片都是它自己的Lucene索引，当您搜索多个分片时，必须将每个分片的不同分数相结合，并且由于不同的路由，各个分数是不同的。

您可以通过将explain添加到您的查询来验证此问题。对于Sand Roger，idf分别计算为idf(docFreq=1, maxDocs=1) = 0.30685282和idf(docFreq=1, maxDocs=2) = 1，这会产生不同的结果。

您可以将分片大小更改为1，或将查询类型更改为dfs类型。搜索对http://localhost:9200/test/user/_search?pretty&query_type=dfs_query_and_fetch会给你正确的分数，因为它

最初分散阶段肚里，并计算分布式词频更精确的得分

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-search-type.html#dfs-query-and-fetch

来源

2014-01-29 10:04:44 knutwalker

查询类型解决了我的问题。谢谢！ – pierallard

的评分将始终具有小数据组和5个碎片的默认Elasticsearch索引设置警惕。

对于像这样的测试，使用单个分片的索引或者使用更大的数据集，因此跨语料库的语料库分布更加平衡。

来源

2014-01-29 09:59:56 karmi

弹性搜索中“_id”字段对搜索方法的影响？

回答

相关问题