Elasticsearch中的带状疱疹，遵守标点符号

我在Elasticsearch中为英国地址建立了一个地址匹配引擎，并且发现带状疱疹非常有用，但是当涉及到标点符号时，我看到了一些问题。一种 “4距离Walmley关闭” 查询返回下面的比赛：Elasticsearch中的带状疱疹，遵守标点符号

单元3和4，距离Walmley钱伯斯，3距离Walmley关闭
平4，距离Walmley法院，10距离Walmley关闭
合作社零售Services Ltd，4 Walmley Close

真正的匹配是3号，但是1和2匹配（错误地），因为它们都变成'4 walmley'时变成带状疱疹。我想告诉木瓦分析仪不会生成横跨逗号的木瓦。因此，例如1）目前我得到：

单元3
3和
和4
4距离Walmley
距离Walmley室
室3
3距离Walmley
walmley关闭

...当实际上我要的是....

单元3
3和
和4个
距离Walmley室
3距离Walmley
距离Walmley关闭

我目前的设置如下。我已经尝试过将标记器从标准换成空白，这有助于保留逗号，并可能避免上述情况（即，我最终在地址1和2中结束'4，walmley'作为我的牌子），但是我结束在我的索引中出现大量不可用的带状疱疹，并且需要7000万个文件才能保持索引大小。

正如你可以在索引设置中看到的，我还有一个street_sym过滤器，我希望能够在我的木瓦中使用，例如，对于这个例子，除了产生'walmley close'之外，我想要'walmley cl'，但是当我试图包含这个时，我得到了'close cl'的瓦片，这些瓦片并不是非常有帮助！

任何来自更有经验的Elasticsearch用户的建议将非常感激。我读过葛姆雷和佟的优秀着作，但无法理解这个问题。

在此先感谢您提供的任何帮助。

"analysis": { 
    "filter": { 
     "shingle": { 
      "type": "shingle", 
      "output_unigrams": false 
     }, 
     "street_sym": { 
      "type": "synonym", 
      "synonyms": [ 
       "st => street", 
       "rd => road", 
       "ave => avenue", 
       "ct => court", 
       "ln => lane", 
       "terr => terrace", 
       "cir => circle", 
       "hwy => highway", 
       "pkwy => parkway", 
       "cl => close", 
       "blvd => boulevard", 
       "dr => drive", 
       "ste => suite", 
       "wy => way", 
       "tr => trail" 
      ] 
     } 
    }, 
    "analyzer": { 
     "shingle": { 
      "type": "custom", 
      "tokenizer": "standard", 
      "filter": [ 
       "lowercase", 
       "shingle" 
      ] 
     } 
    } 
}

来源

2015-06-15 Mark V

即使你得到的只是你想要的那些，“4 Walmley Close”仍然会匹配所有三个，因为它被标记为“4 walmley”和“walmley close”，后者仍然出现在所有三个中。 –

查看你为什么该解决方案仍然不会从所有三个您提供的比赛匹配停止“4距离Walmley关闭”的问题我的意见。但是，至少可以得到你想要的令牌。我不知道它这是最优雅/高性能的解决方案，但使用Pattern Replace，Pattern Capture和Length你的带状疱疹的过滤器，似乎这样的伎俩：

"analysis": { 
    "filter": { 
     "shingle": { 
      "type": "shingle", 
      "output_unigrams": false 
     }, 
     "street_sym": { 
      "type": "synonym", 
      "synonyms": [ 
       "st => street", 
       "rd => road", 
       "ave => avenue", 
       "ct => court", 
       "ln => lane", 
       "terr => terrace", 
       "cir => circle", 
       "hwy => highway", 
       "pkwy => parkway", 
       "cl => close", 
       "blvd => boulevard", 
       "dr => drive", 
       "ste => suite", 
       "wy => way", 
       "tr => trail" 
      ] 
     }, 
     "no_middle_comma": { 
      "type": "pattern_replace", 
      "pattern": ".+,.+", 
      "replacement": "" 
     }, 
     "no_trailing_comma": { 
      "type": "pattern_capture", 
      "preserve_original": false, 
      "patterns": [ 
       "(.*)," 
      ] 
     }, 
     "not_empty": { 
      "type": "length", 
      "min": 1 
     } 
    }, 
    "analyzer": { 
     "test": { 
      "type": "custom", 
      "tokenizer": "whitespace", 
      "filter": [ 
       "lowercase", 
       "street_sym", 
       "shingle", 
       "no_middle_comma", 
       "no_trailing_comma", 
       "not_empty" 
      ] 
     } 
    } 
}

no_middle_comma：用逗号替换任何标记与空令牌
no_trailing_comma的中间位置：替换逗号
not_empty之前与该部分以逗号结尾的任何标记：除去由上述

例如， “单元3和4，距离Walmley钱伯斯，3距离Walmley瘦肉精” 就变成了：

{ 
    "tokens": [ 
     { 
     "token": "units 3", 
     "start_offset": 0, 
     "end_offset": 7, 
     "type": "shingle", 
     "position": 0 
     }, 
     { 
     "token": "3 and", 
     "start_offset": 6, 
     "end_offset": 11, 
     "type": "shingle", 
     "position": 1 
     }, 
     { 
     "token": "and 4", 
     "start_offset": 8, 
     "end_offset": 14, 
     "type": "shingle", 
     "position": 2 
     }, 
     { 
     "token": "walmley chambers", 
     "start_offset": 15, 
     "end_offset": 32, 
     "type": "shingle", 
     "position": 4 
     }, 
     { 
     "token": "3 walmley", 
     "start_offset": 33, 
     "end_offset": 42, 
     "type": "shingle", 
     "position": 6 
     }, 
     { 
     "token": "walmley close", 
     "start_offset": 35, 
     "end_offset": 45, 
     "type": "shingle", 
     "position": 7 
     } 
    ] 
}

请注意，您的同义词过滤器的工作原理： “距离Walmley CL” 变成了 “距离Walmley关闭”。

来源

2015-12-31 01:03:52

Elasticsearch中的带状疱疹，遵守标点符号

回答

相关问题