我需要自动完成phrases。例如，当我搜索“痴呆症alz”，我想得到“痴呆症在阿尔茨海默氏症”。带词组匹配的Edge NGram

为此，我配置了Edge NGram tokenizer。我在查询主体中尝试使用edge_ngram_analyzer和standard作为分析器。尽管如此，当我尝试匹配一个短语时，我无法得到结果。

我在做什么错？

我的查询：

{ 
    "query":{ 
    "multi_match":{ 
     "query":"dementia in alz", 
     "type":"phrase", 
     "analyzer":"edge_ngram_analyzer", 
     "fields":["_all"] 
    } 
    } 
}

我的映射：

... 
"type" : { 
    "_all" : { 
    "analyzer" : "edge_ngram_analyzer", 
    "search_analyzer" : "standard" 
    }, 
    "properties" : { 
    "field" : { 
     "type" : "string", 
     "analyzer" : "edge_ngram_analyzer", 
     "search_analyzer" : "standard" 
    }, 
... 
"settings" : { 
    ... 
    "analysis" : { 
    "filter" : { 
     "stem_possessive_filter" : { 
     "name" : "possessive_english", 
     "type" : "stemmer" 
     } 
    }, 
    "analyzer" : { 
     "edge_ngram_analyzer" : { 
     "filter" : [ "lowercase" ], 
     "tokenizer" : "edge_ngram_tokenizer" 
     } 
    }, 
    "tokenizer" : { 
     "edge_ngram_tokenizer" : { 
     "token_chars" : [ "letter", "digit", "whitespace" ], 
     "min_gram" : "2", 
     "type" : "edgeNGram", 
     "max_gram" : "25" 
     } 
    } 
    } 
    ...

我的文档：

{ 
    "_score": 1.1152233, 
    "_type": "Diagnosis", 
    "_id": "AVZLfHfBE5CzEm8aJ3Xp", 
    "_source": { 
    "@timestamp": "2016-08-02T13:40:48.665Z", 
    "type": "Diagnosis", 
    "Document_ID": "Diagnosis_1400541", 
    "Diagnosis": "F00.0 - Dementia in Alzheimer's disease with early onset", 
    "@version": "1", 
    }, 
    "_index": "carenotes" 
}, 
{ 
    "_score": 1.1152233, 
    "_type": "Diagnosis", 
    "_id": "AVZLfICrE5CzEm8aJ4Dc", 
    "_source": { 
    "@timestamp": "2016-08-02T13:40:51.240Z", 
    "type": "Diagnosis", 
    "Document_ID": "Diagnosis_1424351", 
    "Diagnosis": "F00.1 - Dementia in Alzheimer's disease with late onset", 
    "@version": "1", 
    }, 
    "_index": "carenotes" 
}

的“阿尔茨海默痴呆症” 的分析短语：

{ 
    "tokens": [ 
    { 
     "end_offset": 2, 
     "token": "de", 
     "type": "word", 
     "start_offset": 0, 
     "position": 0 
    }, 
    { 
     "end_offset": 3, 
     "token": "dem", 
     "type": "word", 
     "start_offset": 0, 
     "position": 1 
    }, 
    { 
     "end_offset": 4, 
     "token": "deme", 
     "type": "word", 
     "start_offset": 0, 
     "position": 2 
    }, 
    { 
     "end_offset": 5, 
     "token": "demen", 
     "type": "word", 
     "start_offset": 0, 
     "position": 3 
    }, 
    { 
     "end_offset": 6, 
     "token": "dement", 
     "type": "word", 
     "start_offset": 0, 
     "position": 4 
    }, 
    { 
     "end_offset": 7, 
     "token": "dementi", 
     "type": "word", 
     "start_offset": 0, 
     "position": 5 
    }, 
    { 
     "end_offset": 8, 
     "token": "dementia", 
     "type": "word", 
     "start_offset": 0, 
     "position": 6 
    }, 
    { 
     "end_offset": 9, 
     "token": "dementia ", 
     "type": "word", 
     "start_offset": 0, 
     "position": 7 
    }, 
    { 
     "end_offset": 10, 
     "token": "dementia i", 
     "type": "word", 
     "start_offset": 0, 
     "position": 8 
    }, 
    { 
     "end_offset": 11, 
     "token": "dementia in", 
     "type": "word", 
     "start_offset": 0, 
     "position": 9 
    }, 
    { 
     "end_offset": 12, 
     "token": "dementia in ", 
     "type": "word", 
     "start_offset": 0, 
     "position": 10 
    }, 
    { 
     "end_offset": 13, 
     "token": "dementia in a", 
     "type": "word", 
     "start_offset": 0, 
     "position": 11 
    }, 
    { 
     "end_offset": 14, 
     "token": "dementia in al", 
     "type": "word", 
     "start_offset": 0, 
     "position": 12 
    }, 
    { 
     "end_offset": 15, 
     "token": "dementia in alz", 
     "type": "word", 
     "start_offset": 0, 
     "position": 13 
    }, 
    { 
     "end_offset": 16, 
     "token": "dementia in alzh", 
     "type": "word", 
     "start_offset": 0, 
     "position": 14 
    }, 
    { 
     "end_offset": 17, 
     "token": "dementia in alzhe", 
     "type": "word", 
     "start_offset": 0, 
     "position": 15 
    }, 
    { 
     "end_offset": 18, 
     "token": "dementia in alzhei", 
     "type": "word", 
     "start_offset": 0, 
     "position": 16 
    }, 
    { 
     "end_offset": 19, 
     "token": "dementia in alzheim", 
     "type": "word", 
     "start_offset": 0, 
     "position": 17 
    }, 
    { 
     "end_offset": 20, 
     "token": "dementia in alzheime", 
     "type": "word", 
     "start_offset": 0, 
     "position": 18 
    }, 
    { 
     "end_offset": 21, 
     "token": "dementia in alzheimer", 
     "type": "word", 
     "start_offset": 0, 
     "position": 19 
    } 
    ] 
}

来源

2016-08-09 trex

您是否尝试使用query_string而不是multi_match？请让我知道它是否能解决您的问题。 –

'query_string'默认在'_all'字段中搜索。所以，它和我在'multi_match'和'“fields”中做的一样：[“_all”]'。尽管如此，我试过了，没有成功。我使用了下面的查询：{'query'：{'query_string'：{'query'：'alme'，'phrase_slop'中的痴呆症：0}}}'' – trex

非常感谢rendel谁帮我找到合适的解决方案！

Andrei Stefan的解决方案不是最优的。

为什么？首先，搜索分析器中缺少小写字母过滤器会使搜索不便;该案件必须严格配合。需要使用lowercase过滤器的自定义分析器，而不是"analyzer": "keyword"。

二，分析部分错误！在索引期间，通过edge_ngram_analyzer分析字符串“F00.0-阿尔茨海默病痴呆早期”。有了这个分析，我们有如下的字典作为分析的字符串数组：

{ 
    "tokens": [ 
    { 
     "end_offset": 2, 
     "token": "f0", 
     "type": "word", 
     "start_offset": 0, 
     "position": 0 
    }, 
    { 
     "end_offset": 3, 
     "token": "f00", 
     "type": "word", 
     "start_offset": 0, 
     "position": 1 
    }, 
    { 
     "end_offset": 6, 
     "token": "0 ", 
     "type": "word", 
     "start_offset": 4, 
     "position": 2 
    }, 
    { 
     "end_offset": 9, 
     "token": " ", 
     "type": "word", 
     "start_offset": 7, 
     "position": 3 
    }, 
    { 
     "end_offset": 10, 
     "token": " d", 
     "type": "word", 
     "start_offset": 7, 
     "position": 4 
    }, 
    { 
     "end_offset": 11, 
     "token": " de", 
     "type": "word", 
     "start_offset": 7, 
     "position": 5 
    }, 
    { 
     "end_offset": 12, 
     "token": " dem", 
     "type": "word", 
     "start_offset": 7, 
     "position": 6 
    }, 
    { 
     "end_offset": 13, 
     "token": " deme", 
     "type": "word", 
     "start_offset": 7, 
     "position": 7 
    }, 
    { 
     "end_offset": 14, 
     "token": " demen", 
     "type": "word", 
     "start_offset": 7, 
     "position": 8 
    }, 
    { 
     "end_offset": 15, 
     "token": " dement", 
     "type": "word", 
     "start_offset": 7, 
     "position": 9 
    }, 
    { 
     "end_offset": 16, 
     "token": " dementi", 
     "type": "word", 
     "start_offset": 7, 
     "position": 10 
    }, 
    { 
     "end_offset": 17, 
     "token": " dementia", 
     "type": "word", 
     "start_offset": 7, 
     "position": 11 
    }, 
    { 
     "end_offset": 18, 
     "token": " dementia ", 
     "type": "word", 
     "start_offset": 7, 
     "position": 12 
    }, 
    { 
     "end_offset": 19, 
     "token": " dementia i", 
     "type": "word", 
     "start_offset": 7, 
     "position": 13 
    }, 
    { 
     "end_offset": 20, 
     "token": " dementia in", 
     "type": "word", 
     "start_offset": 7, 
     "position": 14 
    }, 
    { 
     "end_offset": 21, 
     "token": " dementia in ", 
     "type": "word", 
     "start_offset": 7, 
     "position": 15 
    }, 
    { 
     "end_offset": 22, 
     "token": " dementia in a", 
     "type": "word", 
     "start_offset": 7, 
     "position": 16 
    }, 
    { 
     "end_offset": 23, 
     "token": " dementia in al", 
     "type": "word", 
     "start_offset": 7, 
     "position": 17 
    }, 
    { 
     "end_offset": 24, 
     "token": " dementia in alz", 
     "type": "word", 
     "start_offset": 7, 
     "position": 18 
    }, 
    { 
     "end_offset": 25, 
     "token": " dementia in alzh", 
     "type": "word", 
     "start_offset": 7, 
     "position": 19 
    }, 
    { 
     "end_offset": 26, 
     "token": " dementia in alzhe", 
     "type": "word", 
     "start_offset": 7, 
     "position": 20 
    }, 
    { 
     "end_offset": 27, 
     "token": " dementia in alzhei", 
     "type": "word", 
     "start_offset": 7, 
     "position": 21 
    }, 
    { 
     "end_offset": 28, 
     "token": " dementia in alzheim", 
     "type": "word", 
     "start_offset": 7, 
     "position": 22 
    }, 
    { 
     "end_offset": 29, 
     "token": " dementia in alzheime", 
     "type": "word", 
     "start_offset": 7, 
     "position": 23 
    }, 
    { 
     "end_offset": 30, 
     "token": " dementia in alzheimer", 
     "type": "word", 
     "start_offset": 7, 
     "position": 24 
    }, 
    { 
     "end_offset": 33, 
     "token": "s ", 
     "type": "word", 
     "start_offset": 31, 
     "position": 25 
    }, 
    { 
     "end_offset": 34, 
     "token": "s d", 
     "type": "word", 
     "start_offset": 31, 
     "position": 26 
    }, 
    { 
     "end_offset": 35, 
     "token": "s di", 
     "type": "word", 
     "start_offset": 31, 
     "position": 27 
    }, 
    { 
     "end_offset": 36, 
     "token": "s dis", 
     "type": "word", 
     "start_offset": 31, 
     "position": 28 
    }, 
    { 
     "end_offset": 37, 
     "token": "s dise", 
     "type": "word", 
     "start_offset": 31, 
     "position": 29 
    }, 
    { 
     "end_offset": 38, 
     "token": "s disea", 
     "type": "word", 
     "start_offset": 31, 
     "position": 30 
    }, 
    { 
     "end_offset": 39, 
     "token": "s diseas", 
     "type": "word", 
     "start_offset": 31, 
     "position": 31 
    }, 
    { 
     "end_offset": 40, 
     "token": "s disease", 
     "type": "word", 
     "start_offset": 31, 
     "position": 32 
    }, 
    { 
     "end_offset": 41, 
     "token": "s disease ", 
     "type": "word", 
     "start_offset": 31, 
     "position": 33 
    }, 
    { 
     "end_offset": 42, 
     "token": "s disease w", 
     "type": "word", 
     "start_offset": 31, 
     "position": 34 
    }, 
    { 
     "end_offset": 43, 
     "token": "s disease wi", 
     "type": "word", 
     "start_offset": 31, 
     "position": 35 
    }, 
    { 
     "end_offset": 44, 
     "token": "s disease wit", 
     "type": "word", 
     "start_offset": 31, 
     "position": 36 
    }, 
    { 
     "end_offset": 45, 
     "token": "s disease with", 
     "type": "word", 
     "start_offset": 31, 
     "position": 37 
    }, 
    { 
     "end_offset": 46, 
     "token": "s disease with ", 
     "type": "word", 
     "start_offset": 31, 
     "position": 38 
    }, 
    { 
     "end_offset": 47, 
     "token": "s disease with e", 
     "type": "word", 
     "start_offset": 31, 
     "position": 39 
    }, 
    { 
     "end_offset": 48, 
     "token": "s disease with ea", 
     "type": "word", 
     "start_offset": 31, 
     "position": 40 
    }, 
    { 
     "end_offset": 49, 
     "token": "s disease with ear", 
     "type": "word", 
     "start_offset": 31, 
     "position": 41 
    }, 
    { 
     "end_offset": 50, 
     "token": "s disease with earl", 
     "type": "word", 
     "start_offset": 31, 
     "position": 42 
    }, 
    { 
     "end_offset": 51, 
     "token": "s disease with early", 
     "type": "word", 
     "start_offset": 31, 
     "position": 43 
    }, 
    { 
     "end_offset": 52, 
     "token": "s disease with early ", 
     "type": "word", 
     "start_offset": 31, 
     "position": 44 
    }, 
    { 
     "end_offset": 53, 
     "token": "s disease with early o", 
     "type": "word", 
     "start_offset": 31, 
     "position": 45 
    }, 
    { 
     "end_offset": 54, 
     "token": "s disease with early on", 
     "type": "word", 
     "start_offset": 31, 
     "position": 46 
    }, 
    { 
     "end_offset": 55, 
     "token": "s disease with early ons", 
     "type": "word", 
     "start_offset": 31, 
     "position": 47 
    }, 
    { 
     "end_offset": 56, 
     "token": "s disease with early onse", 
     "type": "word", 
     "start_offset": 31, 
     "position": 48 
    } 
    ] 
}

正如你所看到的，整个字符串2至25个字符与令牌大小记号化。字符串以线性方式进行标记，所有空格和位置为每个新标记递增1。

有几个问题与它：

的edge_ngram_analyzer产生无用令牌将永远不会被搜索的，例如： “”， “”， “d”，“ SD”， “W病” 等

此外，它没有产生很多有用的令牌可以使用，例如：“疾病”，“提前发生”等。如果您尝试搜索这些单词中的任何一个，则会有0个结果。

注意，最后一个标记是“s疾病早期”。最后的“t”在哪里？由于"max_gram" : "25"我们“失去了”在所有领域的一些文字。您无法再搜索此文本，因为它没有令牌。

trim过滤器只会混淆过滤额外空格时可能会由标记器完成的问题。

edge_ngram_analyzer递增每个令牌的位置，这对于诸如短语查询的位置查询是有问题的。应该使用edge_ngram_filter而不是在生成ngrams时保留标记的位置。

最佳解决方案。

的映射设置，以使用：

... "mappings": { "Type": { "_all":{ "analyzer": "edge_ngram_analyzer", "search_analyzer": "keyword_analyzer" }, "properties": { "Field": { "search_analyzer": "keyword_analyzer", "type": "string", "analyzer": "edge_ngram_analyzer" }, ... ... "settings": { "analysis": { "filter": { "english_poss_stemmer": { "type": "stemmer", "name": "possessive_english" }, "edge_ngram": { "type": "edgeNGram", "min_gram": "2", "max_gram": "25", "token_chars": ["letter", "digit"] } }, "analyzer": { "edge_ngram_analyzer": { "filter": ["lowercase", "english_poss_stemmer", "edge_ngram"], "tokenizer": "standard" }, "keyword_analyzer": { "filter": ["lowercase", "english_poss_stemmer"], "tokenizer": "standard" } } } } ...

看分析：

{ "tokens": [ { "end_offset": 5, "token": "f0", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 5, "token": "f00", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 5, "token": "f00.", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 5, "token": "f00.0", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 17, "token": "de", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dem", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "deme", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "demen", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dement", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dementi", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dementia", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 20, "token": "in", "type": "word", "start_offset": 18, "position": 3 }, { "end_offset": 32, "token": "al", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alz", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzh", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzhe", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzhei", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzheim", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzheime", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzheimer", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 40, "token": "di", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "dis", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "dise", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "disea", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "diseas", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "disease", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 45, "token": "wi", "type": "word", "start_offset": 41, "position": 6 }, { "end_offset": 45, "token": "wit", "type": "word", "start_offset": 41, "position": 6 }, { "end_offset": 45, "token": "with", "type": "word", "start_offset": 41, "position": 6 }, { "end_offset": 51, "token": "ea", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 51, "token": "ear", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 51, "token": "earl", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 51, "token": "early", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 57, "token": "on", "type": "word", "start_offset": 52, "position": 8 }, { "end_offset": 57, "token": "ons", "type": "word", "start_offset": 52, "position": 8 }, { "end_offset": 57, "token": "onse", "type": "word", "start_offset": 52, "position": 8 }, { "end_offset": 57, "token": "onset", "type": "word", "start_offset": 52, "position": 8 } ] }

在索引时间文本由standard标记生成器标记化，则分开的字由lowercase过滤，possessive_english和edge_ngram过滤器。 令牌只能用于单词。在搜索时间文本被标记为standard标记化器，然后单独的单词被lowercase和possessive_english筛选。搜索到的单词与索引时间期间创建的令牌相匹配。

因此，我们使增量搜索成为可能！

现在，因为我们在不同的话统计ngram，我们甚至可以执行查询，如

{ 'query': { 'multi_match': { 'query': 'dem in alzh', 'type': 'phrase', 'fields': ['_all'] } } }

，并得到正确的结果。

没有文字“丢失”，一切都可以搜索，并且没有必要通过trim筛选来处理空格。

来源

2016-08-11 10:34:13 trex

我不会有时间来提供这样一个精心设计的解决方案，但是我很感谢您花时间来报告您的发现。至少，我能够帮助你发现一个最初的问题。干杯! –

非常感谢@trex，我有相同的要求，只是设置了这个方法。 –

映射大括号中有一个语法问题，解决方案不适合我们？ – tina

我相信你的查询是错误的：当你在索引时需要nGrams，你不需要它们在搜索时。在搜索时，您需要将文本尽可能“固定”。试试这个查询，而不是：

{ 
    "query": { 
    "multi_match": { 
     "query": " dementia in alz", 
     "analyzer": "keyword", 
     "fields": [ 
     "_all" 
     ] 
    } 
    } 
}

你dementia前发现两个空格。这些由您的分析仪从文本中解释。为了摆脱那些你需要的trim token_filter：

"edge_ngram_analyzer": { 
     "filter": [ 
     "lowercase","trim" 
     ], 
     "tokenizer": "edge_ngram_tokenizer" 
    }

然后这个查询就可以了（dementia前无空格）：

{ 
    "query": { 
    "multi_match": { 
     "query": "dementia in alz", 
     "analyzer": "keyword", 
     "fields": [ 
     "_all" 
     ] 
    } 
    } 
}

来源

2016-08-09 14:16:11

我刚刚试过了，0结果。 – trex

另外，我用'lowercase'过滤器重复测试，将以下分析器添加到现有映射“keyword_analyzer”：{“filter”：[“lowercase”]，“tokenizer”：“keyword”}'。查询是'{'query'：{'multi_match'：{'query'：'alz'中的痴呆症'，'analyser'：'keyword_analyzer'，'fields'：['_all']}}}'。没有成功:-( – trex

我需要查看完整的文档以测试索引的** complete **映射。请使用github –

带词组匹配的Edge NGram

回答

Andrei Stefan的解决方案不是最优的。

最佳解决方案。

相关问题