5

我需要自动完成phrases。例如,当我搜索“痴呆症alz”,我想得到“痴呆症在阿尔茨海默氏症”带词组匹配的Edge NGram

为此,我配置了Edge NGram tokenizer。我在查询主体中尝试使用edge_ngram_analyzerstandard作为分析器。尽管如此,当我尝试匹配一个短语时,我无法得到结果。

我在做什么错?

我的查询:

{ 
    "query":{ 
    "multi_match":{ 
     "query":"dementia in alz", 
     "type":"phrase", 
     "analyzer":"edge_ngram_analyzer", 
     "fields":["_all"] 
    } 
    } 
} 

我的映射:

... 
"type" : { 
    "_all" : { 
    "analyzer" : "edge_ngram_analyzer", 
    "search_analyzer" : "standard" 
    }, 
    "properties" : { 
    "field" : { 
     "type" : "string", 
     "analyzer" : "edge_ngram_analyzer", 
     "search_analyzer" : "standard" 
    }, 
... 
"settings" : { 
    ... 
    "analysis" : { 
    "filter" : { 
     "stem_possessive_filter" : { 
     "name" : "possessive_english", 
     "type" : "stemmer" 
     } 
    }, 
    "analyzer" : { 
     "edge_ngram_analyzer" : { 
     "filter" : [ "lowercase" ], 
     "tokenizer" : "edge_ngram_tokenizer" 
     } 
    }, 
    "tokenizer" : { 
     "edge_ngram_tokenizer" : { 
     "token_chars" : [ "letter", "digit", "whitespace" ], 
     "min_gram" : "2", 
     "type" : "edgeNGram", 
     "max_gram" : "25" 
     } 
    } 
    } 
    ... 

我的文档:

{ 
    "_score": 1.1152233, 
    "_type": "Diagnosis", 
    "_id": "AVZLfHfBE5CzEm8aJ3Xp", 
    "_source": { 
    "@timestamp": "2016-08-02T13:40:48.665Z", 
    "type": "Diagnosis", 
    "Document_ID": "Diagnosis_1400541", 
    "Diagnosis": "F00.0 - Dementia in Alzheimer's disease with early onset", 
    "@version": "1", 
    }, 
    "_index": "carenotes" 
}, 
{ 
    "_score": 1.1152233, 
    "_type": "Diagnosis", 
    "_id": "AVZLfICrE5CzEm8aJ4Dc", 
    "_source": { 
    "@timestamp": "2016-08-02T13:40:51.240Z", 
    "type": "Diagnosis", 
    "Document_ID": "Diagnosis_1424351", 
    "Diagnosis": "F00.1 - Dementia in Alzheimer's disease with late onset", 
    "@version": "1", 
    }, 
    "_index": "carenotes" 
} 

“阿尔茨海默痴呆症” 的分析短语:

{ 
    "tokens": [ 
    { 
     "end_offset": 2, 
     "token": "de", 
     "type": "word", 
     "start_offset": 0, 
     "position": 0 
    }, 
    { 
     "end_offset": 3, 
     "token": "dem", 
     "type": "word", 
     "start_offset": 0, 
     "position": 1 
    }, 
    { 
     "end_offset": 4, 
     "token": "deme", 
     "type": "word", 
     "start_offset": 0, 
     "position": 2 
    }, 
    { 
     "end_offset": 5, 
     "token": "demen", 
     "type": "word", 
     "start_offset": 0, 
     "position": 3 
    }, 
    { 
     "end_offset": 6, 
     "token": "dement", 
     "type": "word", 
     "start_offset": 0, 
     "position": 4 
    }, 
    { 
     "end_offset": 7, 
     "token": "dementi", 
     "type": "word", 
     "start_offset": 0, 
     "position": 5 
    }, 
    { 
     "end_offset": 8, 
     "token": "dementia", 
     "type": "word", 
     "start_offset": 0, 
     "position": 6 
    }, 
    { 
     "end_offset": 9, 
     "token": "dementia ", 
     "type": "word", 
     "start_offset": 0, 
     "position": 7 
    }, 
    { 
     "end_offset": 10, 
     "token": "dementia i", 
     "type": "word", 
     "start_offset": 0, 
     "position": 8 
    }, 
    { 
     "end_offset": 11, 
     "token": "dementia in", 
     "type": "word", 
     "start_offset": 0, 
     "position": 9 
    }, 
    { 
     "end_offset": 12, 
     "token": "dementia in ", 
     "type": "word", 
     "start_offset": 0, 
     "position": 10 
    }, 
    { 
     "end_offset": 13, 
     "token": "dementia in a", 
     "type": "word", 
     "start_offset": 0, 
     "position": 11 
    }, 
    { 
     "end_offset": 14, 
     "token": "dementia in al", 
     "type": "word", 
     "start_offset": 0, 
     "position": 12 
    }, 
    { 
     "end_offset": 15, 
     "token": "dementia in alz", 
     "type": "word", 
     "start_offset": 0, 
     "position": 13 
    }, 
    { 
     "end_offset": 16, 
     "token": "dementia in alzh", 
     "type": "word", 
     "start_offset": 0, 
     "position": 14 
    }, 
    { 
     "end_offset": 17, 
     "token": "dementia in alzhe", 
     "type": "word", 
     "start_offset": 0, 
     "position": 15 
    }, 
    { 
     "end_offset": 18, 
     "token": "dementia in alzhei", 
     "type": "word", 
     "start_offset": 0, 
     "position": 16 
    }, 
    { 
     "end_offset": 19, 
     "token": "dementia in alzheim", 
     "type": "word", 
     "start_offset": 0, 
     "position": 17 
    }, 
    { 
     "end_offset": 20, 
     "token": "dementia in alzheime", 
     "type": "word", 
     "start_offset": 0, 
     "position": 18 
    }, 
    { 
     "end_offset": 21, 
     "token": "dementia in alzheimer", 
     "type": "word", 
     "start_offset": 0, 
     "position": 19 
    } 
    ] 
} 
+0

您是否尝试使用query_string而不是multi_match?请让我知道它是否能解决您的问题。 –

+0

'query_string'默认在'_all'字段中搜索。所以,它和我在'multi_match'和'“fields”中做的一样:[“_all”]'。尽管如此,我试过了,没有成功。我使用了下面的查询:{'query':{'query_string':{'query':'alme','phrase_slop'中的痴呆症:0}}}'' – trex

回答

8

非常感谢rendel谁帮我找到合适的解决方案!

Andrei Stefan的解决方案不是最优的。

为什么?首先,搜索分析器中缺少小写字母过滤器会使搜索不便;该案件必须严格配合。需要使用lowercase过滤器的自定义分析器,而不是"analyzer": "keyword"

二,分析部分错误! 在索引期间,通过edge_ngram_analyzer分析字符串“F00.0-阿尔茨海默病痴呆早期”。有了这个分析,我们有如下的字典作为分析的字符串数组:

{ 
    "tokens": [ 
    { 
     "end_offset": 2, 
     "token": "f0", 
     "type": "word", 
     "start_offset": 0, 
     "position": 0 
    }, 
    { 
     "end_offset": 3, 
     "token": "f00", 
     "type": "word", 
     "start_offset": 0, 
     "position": 1 
    }, 
    { 
     "end_offset": 6, 
     "token": "0 ", 
     "type": "word", 
     "start_offset": 4, 
     "position": 2 
    }, 
    { 
     "end_offset": 9, 
     "token": " ", 
     "type": "word", 
     "start_offset": 7, 
     "position": 3 
    }, 
    { 
     "end_offset": 10, 
     "token": " d", 
     "type": "word", 
     "start_offset": 7, 
     "position": 4 
    }, 
    { 
     "end_offset": 11, 
     "token": " de", 
     "type": "word", 
     "start_offset": 7, 
     "position": 5 
    }, 
    { 
     "end_offset": 12, 
     "token": " dem", 
     "type": "word", 
     "start_offset": 7, 
     "position": 6 
    }, 
    { 
     "end_offset": 13, 
     "token": " deme", 
     "type": "word", 
     "start_offset": 7, 
     "position": 7 
    }, 
    { 
     "end_offset": 14, 
     "token": " demen", 
     "type": "word", 
     "start_offset": 7, 
     "position": 8 
    }, 
    { 
     "end_offset": 15, 
     "token": " dement", 
     "type": "word", 
     "start_offset": 7, 
     "position": 9 
    }, 
    { 
     "end_offset": 16, 
     "token": " dementi", 
     "type": "word", 
     "start_offset": 7, 
     "position": 10 
    }, 
    { 
     "end_offset": 17, 
     "token": " dementia", 
     "type": "word", 
     "start_offset": 7, 
     "position": 11 
    }, 
    { 
     "end_offset": 18, 
     "token": " dementia ", 
     "type": "word", 
     "start_offset": 7, 
     "position": 12 
    }, 
    { 
     "end_offset": 19, 
     "token": " dementia i", 
     "type": "word", 
     "start_offset": 7, 
     "position": 13 
    }, 
    { 
     "end_offset": 20, 
     "token": " dementia in", 
     "type": "word", 
     "start_offset": 7, 
     "position": 14 
    }, 
    { 
     "end_offset": 21, 
     "token": " dementia in ", 
     "type": "word", 
     "start_offset": 7, 
     "position": 15 
    }, 
    { 
     "end_offset": 22, 
     "token": " dementia in a", 
     "type": "word", 
     "start_offset": 7, 
     "position": 16 
    }, 
    { 
     "end_offset": 23, 
     "token": " dementia in al", 
     "type": "word", 
     "start_offset": 7, 
     "position": 17 
    }, 
    { 
     "end_offset": 24, 
     "token": " dementia in alz", 
     "type": "word", 
     "start_offset": 7, 
     "position": 18 
    }, 
    { 
     "end_offset": 25, 
     "token": " dementia in alzh", 
     "type": "word", 
     "start_offset": 7, 
     "position": 19 
    }, 
    { 
     "end_offset": 26, 
     "token": " dementia in alzhe", 
     "type": "word", 
     "start_offset": 7, 
     "position": 20 
    }, 
    { 
     "end_offset": 27, 
     "token": " dementia in alzhei", 
     "type": "word", 
     "start_offset": 7, 
     "position": 21 
    }, 
    { 
     "end_offset": 28, 
     "token": " dementia in alzheim", 
     "type": "word", 
     "start_offset": 7, 
     "position": 22 
    }, 
    { 
     "end_offset": 29, 
     "token": " dementia in alzheime", 
     "type": "word", 
     "start_offset": 7, 
     "position": 23 
    }, 
    { 
     "end_offset": 30, 
     "token": " dementia in alzheimer", 
     "type": "word", 
     "start_offset": 7, 
     "position": 24 
    }, 
    { 
     "end_offset": 33, 
     "token": "s ", 
     "type": "word", 
     "start_offset": 31, 
     "position": 25 
    }, 
    { 
     "end_offset": 34, 
     "token": "s d", 
     "type": "word", 
     "start_offset": 31, 
     "position": 26 
    }, 
    { 
     "end_offset": 35, 
     "token": "s di", 
     "type": "word", 
     "start_offset": 31, 
     "position": 27 
    }, 
    { 
     "end_offset": 36, 
     "token": "s dis", 
     "type": "word", 
     "start_offset": 31, 
     "position": 28 
    }, 
    { 
     "end_offset": 37, 
     "token": "s dise", 
     "type": "word", 
     "start_offset": 31, 
     "position": 29 
    }, 
    { 
     "end_offset": 38, 
     "token": "s disea", 
     "type": "word", 
     "start_offset": 31, 
     "position": 30 
    }, 
    { 
     "end_offset": 39, 
     "token": "s diseas", 
     "type": "word", 
     "start_offset": 31, 
     "position": 31 
    }, 
    { 
     "end_offset": 40, 
     "token": "s disease", 
     "type": "word", 
     "start_offset": 31, 
     "position": 32 
    }, 
    { 
     "end_offset": 41, 
     "token": "s disease ", 
     "type": "word", 
     "start_offset": 31, 
     "position": 33 
    }, 
    { 
     "end_offset": 42, 
     "token": "s disease w", 
     "type": "word", 
     "start_offset": 31, 
     "position": 34 
    }, 
    { 
     "end_offset": 43, 
     "token": "s disease wi", 
     "type": "word", 
     "start_offset": 31, 
     "position": 35 
    }, 
    { 
     "end_offset": 44, 
     "token": "s disease wit", 
     "type": "word", 
     "start_offset": 31, 
     "position": 36 
    }, 
    { 
     "end_offset": 45, 
     "token": "s disease with", 
     "type": "word", 
     "start_offset": 31, 
     "position": 37 
    }, 
    { 
     "end_offset": 46, 
     "token": "s disease with ", 
     "type": "word", 
     "start_offset": 31, 
     "position": 38 
    }, 
    { 
     "end_offset": 47, 
     "token": "s disease with e", 
     "type": "word", 
     "start_offset": 31, 
     "position": 39 
    }, 
    { 
     "end_offset": 48, 
     "token": "s disease with ea", 
     "type": "word", 
     "start_offset": 31, 
     "position": 40 
    }, 
    { 
     "end_offset": 49, 
     "token": "s disease with ear", 
     "type": "word", 
     "start_offset": 31, 
     "position": 41 
    }, 
    { 
     "end_offset": 50, 
     "token": "s disease with earl", 
     "type": "word", 
     "start_offset": 31, 
     "position": 42 
    }, 
    { 
     "end_offset": 51, 
     "token": "s disease with early", 
     "type": "word", 
     "start_offset": 31, 
     "position": 43 
    }, 
    { 
     "end_offset": 52, 
     "token": "s disease with early ", 
     "type": "word", 
     "start_offset": 31, 
     "position": 44 
    }, 
    { 
     "end_offset": 53, 
     "token": "s disease with early o", 
     "type": "word", 
     "start_offset": 31, 
     "position": 45 
    }, 
    { 
     "end_offset": 54, 
     "token": "s disease with early on", 
     "type": "word", 
     "start_offset": 31, 
     "position": 46 
    }, 
    { 
     "end_offset": 55, 
     "token": "s disease with early ons", 
     "type": "word", 
     "start_offset": 31, 
     "position": 47 
    }, 
    { 
     "end_offset": 56, 
     "token": "s disease with early onse", 
     "type": "word", 
     "start_offset": 31, 
     "position": 48 
    } 
    ] 
} 

正如你所看到的,整个字符串2至25个字符与令牌大小记号化。字符串以线性方式进行标记,所有空格和位置为每个新标记递增1。

有几个问题与它:

  1. edge_ngram_analyzer产生无用令牌将永远不会被搜索的,例如: “”, “”, “d”,“ SD”, “W病” 等
  2. 此外,它没有产生很多有用的令牌可以使用,例如:“疾病”,“提前发生”等。如果您尝试搜索这些单词中的任何一个,则会有0个结果。
  3. 注意,最后一个标记是“s疾病早期”。最后的“t”在哪里?由于"max_gram" : "25"我们“失去了”在所有领域的一些文字。您无法再搜索此文本,因为它没有令牌。
  4. trim过滤器只会混淆过滤额外空格时可能会由标记器完成的问题。
  5. edge_ngram_analyzer递增每个令牌的位置,这对于诸如短语查询的位置查询是有问题的。应该使用edge_ngram_filter而不是在生成ngrams时保留标记的位置

最佳解决方案。

的映射设置,以使用:

... 
"mappings": { 
    "Type": { 
     "_all":{ 
      "analyzer": "edge_ngram_analyzer", 
      "search_analyzer": "keyword_analyzer" 
     }, 
     "properties": { 
      "Field": { 
      "search_analyzer": "keyword_analyzer", 
      "type": "string", 
      "analyzer": "edge_ngram_analyzer" 
      }, 
... 
... 
"settings": { 
    "analysis": { 
     "filter": { 
     "english_poss_stemmer": { 
      "type": "stemmer", 
      "name": "possessive_english" 
     }, 
     "edge_ngram": { 
      "type": "edgeNGram", 
      "min_gram": "2", 
      "max_gram": "25", 
      "token_chars": ["letter", "digit"] 
     } 
     }, 
     "analyzer": { 
     "edge_ngram_analyzer": { 
      "filter": ["lowercase", "english_poss_stemmer", "edge_ngram"], 
      "tokenizer": "standard" 
     }, 
     "keyword_analyzer": { 
      "filter": ["lowercase", "english_poss_stemmer"], 
      "tokenizer": "standard" 
     } 
     } 
    } 
} 
... 

看分析:

{ 
    "tokens": [ 
    { 
     "end_offset": 5, 
     "token": "f0", 
     "type": "word", 
     "start_offset": 0, 
     "position": 0 
    }, 
    { 
     "end_offset": 5, 
     "token": "f00", 
     "type": "word", 
     "start_offset": 0, 
     "position": 0 
    }, 
    { 
     "end_offset": 5, 
     "token": "f00.", 
     "type": "word", 
     "start_offset": 0, 
     "position": 0 
    }, 
    { 
     "end_offset": 5, 
     "token": "f00.0", 
     "type": "word", 
     "start_offset": 0, 
     "position": 0 
    }, 
    { 
     "end_offset": 17, 
     "token": "de", 
     "type": "word", 
     "start_offset": 9, 
     "position": 2 
    }, 
    { 
     "end_offset": 17, 
     "token": "dem", 
     "type": "word", 
     "start_offset": 9, 
     "position": 2 
    }, 
    { 
     "end_offset": 17, 
     "token": "deme", 
     "type": "word", 
     "start_offset": 9, 
     "position": 2 
    }, 
    { 
     "end_offset": 17, 
     "token": "demen", 
     "type": "word", 
     "start_offset": 9, 
     "position": 2 
    }, 
    { 
     "end_offset": 17, 
     "token": "dement", 
     "type": "word", 
     "start_offset": 9, 
     "position": 2 
    }, 
    { 
     "end_offset": 17, 
     "token": "dementi", 
     "type": "word", 
     "start_offset": 9, 
     "position": 2 
    }, 
    { 
     "end_offset": 17, 
     "token": "dementia", 
     "type": "word", 
     "start_offset": 9, 
     "position": 2 
    }, 
    { 
     "end_offset": 20, 
     "token": "in", 
     "type": "word", 
     "start_offset": 18, 
     "position": 3 
    }, 
    { 
     "end_offset": 32, 
     "token": "al", 
     "type": "word", 
     "start_offset": 21, 
     "position": 4 
    }, 
    { 
     "end_offset": 32, 
     "token": "alz", 
     "type": "word", 
     "start_offset": 21, 
     "position": 4 
    }, 
    { 
     "end_offset": 32, 
     "token": "alzh", 
     "type": "word", 
     "start_offset": 21, 
     "position": 4 
    }, 
    { 
     "end_offset": 32, 
     "token": "alzhe", 
     "type": "word", 
     "start_offset": 21, 
     "position": 4 
    }, 
    { 
     "end_offset": 32, 
     "token": "alzhei", 
     "type": "word", 
     "start_offset": 21, 
     "position": 4 
    }, 
    { 
     "end_offset": 32, 
     "token": "alzheim", 
     "type": "word", 
     "start_offset": 21, 
     "position": 4 
    }, 
    { 
     "end_offset": 32, 
     "token": "alzheime", 
     "type": "word", 
     "start_offset": 21, 
     "position": 4 
    }, 
    { 
     "end_offset": 32, 
     "token": "alzheimer", 
     "type": "word", 
     "start_offset": 21, 
     "position": 4 
    }, 
    { 
     "end_offset": 40, 
     "token": "di", 
     "type": "word", 
     "start_offset": 33, 
     "position": 5 
    }, 
    { 
     "end_offset": 40, 
     "token": "dis", 
     "type": "word", 
     "start_offset": 33, 
     "position": 5 
    }, 
    { 
     "end_offset": 40, 
     "token": "dise", 
     "type": "word", 
     "start_offset": 33, 
     "position": 5 
    }, 
    { 
     "end_offset": 40, 
     "token": "disea", 
     "type": "word", 
     "start_offset": 33, 
     "position": 5 
    }, 
    { 
     "end_offset": 40, 
     "token": "diseas", 
     "type": "word", 
     "start_offset": 33, 
     "position": 5 
    }, 
    { 
     "end_offset": 40, 
     "token": "disease", 
     "type": "word", 
     "start_offset": 33, 
     "position": 5 
    }, 
    { 
     "end_offset": 45, 
     "token": "wi", 
     "type": "word", 
     "start_offset": 41, 
     "position": 6 
    }, 
    { 
     "end_offset": 45, 
     "token": "wit", 
     "type": "word", 
     "start_offset": 41, 
     "position": 6 
    }, 
    { 
     "end_offset": 45, 
     "token": "with", 
     "type": "word", 
     "start_offset": 41, 
     "position": 6 
    }, 
    { 
     "end_offset": 51, 
     "token": "ea", 
     "type": "word", 
     "start_offset": 46, 
     "position": 7 
    }, 
    { 
     "end_offset": 51, 
     "token": "ear", 
     "type": "word", 
     "start_offset": 46, 
     "position": 7 
    }, 
    { 
     "end_offset": 51, 
     "token": "earl", 
     "type": "word", 
     "start_offset": 46, 
     "position": 7 
    }, 
    { 
     "end_offset": 51, 
     "token": "early", 
     "type": "word", 
     "start_offset": 46, 
     "position": 7 
    }, 
    { 
     "end_offset": 57, 
     "token": "on", 
     "type": "word", 
     "start_offset": 52, 
     "position": 8 
    }, 
    { 
     "end_offset": 57, 
     "token": "ons", 
     "type": "word", 
     "start_offset": 52, 
     "position": 8 
    }, 
    { 
     "end_offset": 57, 
     "token": "onse", 
     "type": "word", 
     "start_offset": 52, 
     "position": 8 
    }, 
    { 
     "end_offset": 57, 
     "token": "onset", 
     "type": "word", 
     "start_offset": 52, 
     "position": 8 
    } 
    ] 
} 

在索引时间文本由standard标记生成器标记化,则分开的字由lowercase过滤,possessive_englishedge_ngram过滤器。 令牌只能用于单词。 在搜索时间文本被标记为standard标记化器,然后单独的单词被lowercasepossessive_english筛选。搜索到的单词与索引时间期间创建的令牌相匹配。

因此,我们使增量搜索成为可能!

现在,因为我们在不同的话统计ngram,我们甚至可以执行查询,如

{ 
    'query': { 
    'multi_match': { 
     'query': 'dem in alzh', 
     'type': 'phrase', 
     'fields': ['_all'] 
    } 
    } 
} 

,并得到正确的结果。

没有文字“丢失”,一切都可以搜索,并且没有必要通过trim筛选来处理空格。

+0

我不会有时间来提供这样一个精心设计的解决方案,但是我很感谢您花时间来报告您的发现。至少,我能够帮助你发现一个最初的问题。干杯! –

+0

非常感谢@trex,我有相同的要求,只是设置了这个方法。 –

+0

映射大括号中有一个语法问题,解决方案不适合我们? – tina

6

我相信你的查询是错误的:当你在索引时需要nGrams,你不需要它们在搜索时。在搜索时,您需要将文本尽可能“固定”。 试试这个查询,而不是:

{ 
    "query": { 
    "multi_match": { 
     "query": " dementia in alz", 
     "analyzer": "keyword", 
     "fields": [ 
     "_all" 
     ] 
    } 
    } 
} 

dementia前发现两个空格。这些由您的分析仪从文本中解释。为了摆脱那些你需要的trim token_filter:

"edge_ngram_analyzer": { 
     "filter": [ 
     "lowercase","trim" 
     ], 
     "tokenizer": "edge_ngram_tokenizer" 
    } 

然后这个查询就可以了(dementia前无空格):

{ 
    "query": { 
    "multi_match": { 
     "query": "dementia in alz", 
     "analyzer": "keyword", 
     "fields": [ 
     "_all" 
     ] 
    } 
    } 
} 
+0

我刚刚试过了,0结果。 – trex

+0

另外,我用'lowercase'过滤器重复测试,将以下分析器添加到现有映射“keyword_analyzer”:{“filter”:[“lowercase”],“tokenizer”:“keyword”}'。查询是'{'query':{'multi_match':{'query':'alz'中的痴呆症','analyser':'keyword_analyzer','fields':['_all']}}}'。没有成功:-( – trex

+0

我需要查看完整的文档以测试索引的** complete **映射。请使用github –