filterby，groupby唯一字段值，总和聚合，orderby在elasticsearch查询链

我想发出一个查询，以弹性搜索过滤，按组，按总和聚合和排序。我有两个问题：查询应该如何以及弹性搜索对性能的影响是什么？filterby，groupby唯一字段值，总和聚合，orderby在elasticsearch查询链

让我举一个数据集来支持我的问题。比方说，我有一个集销售：

document type: 'sales' with the following fields and data: 
sale_datetime | sold_product | sold_at_price 
-----------------|---------------|-------------- 
2015-11-24 12:00 | some product | 100 
2015-11-24 12:30 | some product | 100 
2015-11-24 12:30 | other product | 100 
2015-11-24 13:00 | other product | 100 
2015-11-24 12:30 | some product | 200 
2015-11-24 13:00 | some product | 200

我想发出一个查询，其中：

只考虑在时间间隔从2015年11月24日12:15销售到2015年11月24日12点45
组的结果通过sold_product字段
计算在顺序
返回行中的“过度每个产品sold_at_price值总和”，超过每PR sold_at_price值最大的“总和oduct'先来，然后是第二个，等等。

它应用到上面设置的样本数据，它会返回以下结果：

sold_product | sum of sold_at_price 
--------------|-------------- 
some product | 300  // takes into account rows 2 and 5 
other product | 100  // takes into account row 3

如果有可能发出这样的询问，什么是elasticsearch性能的重要意义？如果它的事项进行审议：

有很多（数十万，数百万潜在的未来）的独特产品
产品名称可以包含多个（几十）字/项（这是可能产生一个唯一的产品名称只包含1个字，但它几乎是数据量的两倍）
通常有很多（百万）记录满足时间范围过滤器（在某些情况下，过滤器可以缩小到几万记录在一定的时间范围内，但不能保证）

在此先感谢您的帮助！

来源

2015-11-24 Andrew

这是aggregations的典型使用案例。我们首先创建一个索引并建模数据的映射。我们有一个正常的date field for sold_datetime，另一个numeric field for sold_at_price和一个multi-field of type string for sold_product。你会发现，这种多领域有子场称为raw是not_analyzed，将被用于创建上的产品名称汇聚：

curl -XPUT localhost:9200/sales -d '{ 
    "mappings": { 
    "sale": { 
     "properties": { 
     "sale_datetime": { 
      "type": "date" 
     }, 
     "sold_product": { 
      "type": "string", 
      "fields": { 
      "raw": { 
       "type": "string", 
       "index": "not_analyzed" 
      } 
      } 
     }, 
     "sold_at_price": { 
      "type": "double" 
     } 
     } 
    } 
    } 
}'

现在，让我们指数的样本数据集使用_bulk端点新指数：

curl -XPOST localhost:9200/sales/sale/_bulk -d ' 
{"index": {}} 
{"sold_datetime": "2015-11-24T12:00:00.000Z", "sold_product":"some product", "sold_at_price": 100} 
{"index": {}} 
{"sold_datetime": "2015-11-24T12:30:00.000Z", "sold_product":"some product", "sold_at_price": 100} 
{"index": {}} 
{"sold_datetime": "2015-11-24T12:30:00.000Z", "sold_product":"other product", "sold_at_price": 100} 
{"index": {}} 
{"sold_datetime": "2015-11-24T13:00:00.000Z", "sold_product":"other product", "sold_at_price": 100} 
{"index": {}} 
{"sold_datetime": "2015-11-24T12:30:00.000Z", "sold_product":"some product", "sold_at_price": 200} 
{"index": {}} 
{"sold_datetime": "2015-11-24T13:00:00.000Z", "sold_product":"some product", "sold_at_price": 200} 
'

最后，让我们来创建你所需要的查询和汇总：

curl -XPOST localhost:9200/sales/sale/_search -d '{ 
    "size": 0, 
    "query": { 
    "filtered": { 
     "filter": { 
     "range": { 
      "sold_datetime": { 
      "gt": "2015-11-24T12:15:00.000Z", 
      "lt": "2015-11-24T12:45:00.000Z" 
      } 
     } 
     } 
    } 
    }, 
    "aggs": { 
    "sold_products": { 
     "terms": { 
     "field": "sold_product.raw", 
     "order": { 
      "total": "desc" 
     } 
     }, 
     "aggs": { 
     "total": { 
      "sum": { 
      "field": "sold_at_price" 
      } 
     } 
     } 
    } 
    } 
}'

正如您所见，我们正在筛选sold_datetime字段的特定日期间隔（11月24日12：15-12：45）。聚合部分在sold_product.raw字段上定义terms aggregation，并为每个桶我们sum字段的值为sold_at_price。

请注意，如果您有几百万个可能匹配的文档，为了使其具有高性能，您需要首先应用最积极的过滤器，也许是您运行查询的业务的标识，或者某些其他标准将在运行聚合之前排除尽可能多的文档。

结果看起来是这样的：

{ 
    ... 
    "aggregations" : { 
    "sold_products" : { 
     "doc_count_error_upper_bound" : 0, 
     "sum_other_doc_count" : 0, 
     "buckets" : [ { 
     "key" : "some product", 
     "doc_count" : 2, 
     "total" : { 
      "value" : 300.0 
     } 
     }, { 
     "key" : "other product", 
     "doc_count" : 1, 
     "total" : { 
      "value" : 100.0 
     } 
     } ] 
    } 
    } 
}

来源

2015-11-25 05:08:46 Val

谢谢！那是我需要的。我会考虑如何应用更多的过滤器来减少处理记录的总数。 – Andrew

很高兴帮助！ – Val

filterby，groupby唯一字段值，总和聚合，orderby在elasticsearch查询链

回答

相关问题