所有这些都可以通过Elasticsearch中的configuring/writing a custom analyzer来实现。要回答每一个问题依次是:
同义词
同义词可以在任一时间的索引,搜索时间或两者应用。有权衡考虑在哪个方法您选择在索引时间
- 运用同义词将导致更快的搜索相比,将在搜索时,在更多的磁盘空间,索引吞吐量成本,缓解和灵活性加入/删除现有的同义词
- 在搜索时应用同义词可以提高搜索速度的灵活性。
还需要考虑同义词列表的大小和频率(如果有的话)发生变化的频率。我会考虑尝试并决定哪种方案最适合您的方案和要求。
奇异字(鞋和鞋应该是相同的匹配)
您可以考虑使用stemming减少复数和单数词语到其根形式,使用的算法或基于字典的词干。也许从English Snowball stemmer开始,看看它是如何为你工作的。
您还应该考虑是否还需要索引原始单词形式,例如应该确切的单词匹配的排名高于词根的形式?
小的拼写错误,替换和遗漏应该被允许
考虑使用可以利用fuzziness处理拼写错误和拼写错误的查询。如果索引数据中存在拼写错误,请在索引之前考虑某种形式的清理。按照所有的数据存储,垃圾进,垃圾出:)
忽略所有停用词
使用English Stop token filter删除停用词。
把所有的这一起,一个实例分析可能看起来像
void Main()
{
var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var defaultIndex = "default-index";
var connectionSettings = new ConnectionSettings(pool)
.DefaultIndex(defaultIndex);
var client = new ElasticClient(connectionSettings);
if (client.IndexExists(defaultIndex).Exists)
client.DeleteIndex(defaultIndex);
client.CreateIndex(defaultIndex, c => c
.Settings(s => s
.Analysis(a => a
.TokenFilters(t => t
.Stop("my_stop", st => st
.StopWords("_english_", "i've")
.RemoveTrailing()
)
.Synonym("my_synonym", st => st
.Synonyms(
"dap, sneaker, pump, trainer",
"soccer => football"
)
)
.Snowball("my_snowball", st => st
.Language(SnowballLanguage.English)
)
)
.Analyzers(an => an
.Custom("my_analyzer", ca => ca
.Tokenizer("standard")
.Filters(
"lowercase",
"my_stop",
"my_snowball",
"my_synonym"
)
)
)
)
)
.Mappings(m => m
.Map<Message>(mm => mm
.Properties(p => p
.Text(t => t
.Name(n => n.Content)
.Analyzer("my_analyzer")
)
)
)
)
);
client.Analyze(a => a
.Index(defaultIndex)
.Field<Message>(f => f.Content)
.Text("Loving those Billy! Them is the maddest soccer trainers I've ever seen!")
);
}
public class Message
{
public string Content { get; set; }
}
my_analyzer
产生以下令牌上面
{
"tokens" : [
{
"token" : "love",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "those",
"start_offset" : 7,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "billi",
"start_offset" : 13,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "them",
"start_offset" : 20,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "maddest",
"start_offset" : 32,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "football",
"start_offset" : 40,
"end_offset" : 46,
"type" : "SYNONYM",
"position" : 7
},
{
"token" : "trainer",
"start_offset" : 47,
"end_offset" : 55,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "dap",
"start_offset" : 47,
"end_offset" : 55,
"type" : "SYNONYM",
"position" : 8
},
{
"token" : "sneaker",
"start_offset" : 47,
"end_offset" : 55,
"type" : "SYNONYM",
"position" : 8
},
{
"token" : "pump",
"start_offset" : 47,
"end_offset" : 55,
"type" : "SYNONYM",
"position" : 8
},
{
"token" : "ever",
"start_offset" : 61,
"end_offset" : 65,
"type" : "<ALPHANUM>",
"position" : 10
},
{
"token" : "seen",
"start_offset" : 66,
"end_offset" : 70,
"type" : "<ALPHANUM>",
"position" : 11
}
]
}
非常感谢拉斯,你不知道会有多大帮助,这是! (我可能会回来这里有相关的问题:)) – Thomas