0
我正在使用ElasticSearch存储我从Twitter Streaming API收到的Tweets。在存储它们之前,我想将英文词干应用于Tweet内容,并且要做到这一点,我试图使用ElasticSearch分析器,但没有运气。ElasticSearch中的分析器无法正常工作
这是我使用的当前模板:
PUT _template/twitter
{
"template": "139*",
"settings" : {
"index":{
"analysis":{
"analyzer":{
"english":{
"type":"custom",
"tokenizer":"standard",
"filter":["lowercase", "en_stemmer", "stop_english", "asciifolding"]
}
},
"filter":{
"stop_english":{
"type":"stop",
"stopwords":["_english_"]
},
"en_stemmer" : {
"type" : "stemmer",
"name" : "english"
}
}
}
}
},
"mappings": {
"tweet": {
"_timestamp": {
"enabled": true,
"store": true,
"index": "analyzed"
},
"_index": {
"enabled": true,
"store": true,
"index": "analyzed"
},
"properties": {
"geo": {
"properties": {
"coordinates": {
"type": "geo_point"
}
}
},
"text": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
当我开始了流媒体和创建了索引,我已经定义了所有映射似乎正确适用,但文本存储,因为它来自Twitter,完全原始。索引元数据显示:
"settings" : {
"index" : {
"uuid" : "xIOkEcoySAeZORr7pJeTNg",
"analysis" : {
"filter" : {
"en_stemmer" : {
"type" : "stemmer",
"name" : "english"
},
"stop_english" : {
"type" : "stop",
"stopwords" : [
"_english_"
]
}
},
"analyzer" : {
"english" : {
"type" : "custom",
"filter" : [
"lowercase",
"en_stemmer",
"stop_english",
"asciifolding"
],
"tokenizer" : "standard"
}
}
},
"number_of_replicas" : "1",
"number_of_shards" : "5",
"version" : {
"created" : "1010099"
}
}
},
"mappings" : {
"tweet" : {
[...]
"text" : {
"analyzer" : "english",
"type" : "string"
},
[...]
}
}
我在做什么错了?分析仪似乎正确应用,但没有任何事情发生:/
谢谢!
PS:我用它来实现分析器的搜索查询不被应用:
curl -XGET 'http://localhost:9200/_all/_search?pretty' -d '{
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "_index:1397574496990"
}
}
]
}
},
"filter": {
"bool": {
"must": [
{
"match_all": {}
},
{
"exists": {
"field": "geo.coordinates"
}
}
]
}
}
}
},
"fields": [
"geo.coordinates",
"text"
],
"size": 50000
}'
这将返回朵朵文本的领域之一,但反应是:
{
"took": 29,
"timed_out": false,
"_shards": {
"total": 47,
"successful": 47,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.97402453,
"hits": [
{
"_index": "1397574496990",
"_type": "tweet",
"_id": "456086643423068161",
"_score": 0.97402453,
"fields": {
"geo.coordinates": [
-118.21122533,
33.79349318
],
"text": [
"Happy turtle Tuesday ! The week is slowly crawling to Wednesday good morning everyone ☀️#turtles… http://t.co/wAVmcxnf76"
]
}
},
{
"_index": "1397574496990",
"_type": "tweet",
"_id": "456086701451259904",
"_score": 0.97333175,
"fields": {
"geo.coordinates": [
-81.017636,
33.998741
],
"text": [
"Tuesday is Twins Day over here, apparently (it's a far too often occurrence) #tuesdaytwinsday… http://t.co/Umhtp6SoX6"
]
}
}
]
}
}
文本字段与Twitter相同(我使用的是流API)。我期望的是在分析器被应用的时候,文本字段被阻止。
你什么意思是“什么都没有发生”?分析仪不会影响数据的存储方式。它们只影响数据索引的方式。你是否尝试在分析的领域进行搜索以查看是否阻止了工作?您是否尝试使用[分析](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html)方法来查看您的分析仪是否正在应用? – imotov
使用我的自定义分析器的分析方法有效,但是当我尝试使用GET查询来检索字段“text”时,分析器没有被应用,所以有一些我做错了:/ – Cea33
您可以添加一个您尝试搜索的数据和不起作用的搜索查询? – imotov