2012-11-04 29 views
2

我一直在尝试使用facet来获取术语字段的频率。我的查询只返回一个命中,所以我想让该方面返回特定字段中频率最高的术语。elasticsearch - 单个字段的返回术语频率

我的映射:

{ 
"mappings":{ 
    "document":{ 
     "properties":{ 
      "tags":{ 
       "type":"object", 
       "properties":{ 
        "title":{ 
         "fields":{ 
          "partial":{ 
           "search_analyzer":"main", 
           "index_analyzer":"partial", 
           "type":"string", 
           "index" : "analyzed" 
          } 
          "title":{ 
           "type":"string", 
           "analyzer":"main", 
           "index" : "analyzed" 
          } 
         }, 
         "type":"multi_field" 
        } 
       } 
      } 
     } 
    } 
}, 

"settings":{ 
    "analysis":{ 
     "filter":{ 
      "name_ngrams":{ 
       "side":"front", 
       "max_gram":50, 
       "min_gram":2, 
       "type":"edgeNGram" 
      } 
     }, 

     "analyzer":{ 
      "main":{ 
       "filter": ["standard", "lowercase", "asciifolding"], 
       "type": "custom", 
       "tokenizer": "standard" 
      }, 
      "partial":{ 
       "filter":["standard","lowercase","asciifolding","name_ngrams"], 
       "type": "custom", 
       "tokenizer": "standard" 
      } 
     } 
    } 
} 

} 

测试数据:

curl -XPUT localhost:9200/testindex/document -d '{"tags": {"title": "people also kill people"}}' 

查询:

curl -XGET 'localhost:9200/testindex/document/_search?pretty=1' -d ' 
{ 
    "query": 
    { 
     "term": { "tags.title": "people" } 
    }, 
    "facets": { 
     "popular_tags": { "terms": {"field": "tags.title"}} 
    } 
}' 

这个结果

"hits" : { 
    "total" : 1, 
    "max_score" : 0.99381393, 
    "hits" : [ { 
    "_index" : "testindex", 
    "_type" : "document", 
    "_id" : "uI5k0wggR9KAvG9o7S7L2g", 
    "_score" : 0.99381393, "_source" : {"tags": {"title": "people also kill people"}} 
} ] 
}, 
"facets" : { 
    "popular_tags" : { 
    "_type" : "terms", 
    "missing" : 0, 
    "total" : 3, 
    "other" : 0, 
    "terms" : [ { 
    "term" : "people", 
    "count" : 1   // I expect this to be 2 
    }, { 
    "term" : "kill", 
    "count" : 1 
    }, { 
    "term" : "also", 
    "count" : 1 
    } ] 
} 

}

以上结果不是我想要的。我想让频率数为2

"hits" : { 
    "total" : 1, 
    "max_score" : 0.99381393, 
    "hits" : [ { 
    "_index" : "testindex", 
    "_type" : "document", 
    "_id" : "uI5k0wggR9KAvG9o7S7L2g", 
    "_score" : 0.99381393, "_source" : {"tags": {"title": "people also kill people"}} 
} ] 
}, 
"facets" : { 
"popular_tags" : { 
    "_type" : "terms", 
    "missing" : 0, 
    "total" : 3, 
    "other" : 0, 
    "terms" : [ { 
    "term" : "people", 
    "count" : 2    
    }, { 
    "term" : "kill", 
    "count" : 1 
    }, { 
    "term" : "also", 
    "count" : 1 
    } ] 
} 
} 

我该如何做到这一点?面对错误的路要走吗?

+0

我可以知道我的答案是否有帮助吗? – javanna

+0

是的,这真的很有帮助 – Kennedy

回答

6

一个方面计数的文件,而不是属于他们的条款。你得到1,因为只有一个文件包含该术语,发生多少次并不重要。我不知道用什么方法可以返回术语频率,但这一面并不是一个好的选择。
如果启用术语向量,那么可以将这些信息存储在索引中,但现在无法从elasticsearch读取术语向量。

+0

有没有办法做到这一点,而不使用方面? – brycemcd

+3

当term_vectors暴露(但您确实需要存储term_vectors)时,有1.0(beta2可用):http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-termvectors.html。 – javanna

0

不幸的是,字段的频率在Elastic中不可用。 GitHub项目Index TermList正在使用Lucene的条款并计算所有文档的总次数,您可以检查它并根据您的需要进行替换。