使用Python API在Elasticsearch中滚动时发生分段错误

我使用Elasticsearch Python API根据存储在ES群集中的数据计算某些内容。对于我的计算，我需要调用所有满足一定条件的文档，并从中获得某些信息。因此我正在做一个大小为1000并且持续时间为1秒的滚动。我写了一个Python脚本，它使用ES-Python为我完成这项工作。使用Python API在Elasticsearch中滚动时发生分段错误

但是，总是在超过1400个卷轴之后脚本退出并显示错误“Segmentation fault (core dumped)”。我试图将滚动大小增加到10000，但仍然发生相同的问题。以下是脚本的，我正在做的滚动部分：

page = Elasticsearch().search(index = my_index, scroll = "1s", size = 1000, body = { "_source" : [ "_id", "@timestamp", my_field], "query" : {"bool":{"must" : [{"exists":{ "field" : my_field }},{"exists":{ "field" : "@timestamp" }}]}}}) 
sid = page['_scroll_id'] 
scroll_size = page['hits']['total'] 
while (scroll_size > 0): 
    print "Scrolling..." 
    # Get the number of results that we returned in the last scroll 
    scroll_size = len(page['hits']['hits']) 
    print "scroll size: " + str(scroll_size) 
    page = Elasticsearch().scroll(scroll_id = sid, scroll = '1s') 
    # Update the scroll ID 
    sid = page['_scroll_id']

我可以找出该行page = Elasticsearch().scroll(scroll_id = sid, scroll = '1s')负责错误。我已经检查过滚动ID，它总是一样的（至少在错误被抛出之前）。有人遇到过类似的问题，或者有人知道如何解决这个问题吗？

我在OS Ubuntu 14.04的同一台服务器上同时运行Python和Elasticsearch。 Python版本是2.7.6和ES版本是5.0.0

来源

2017-01-23 mshabeeb

你有没有考虑过使用扫描助手呢？（http://elasticsearch-py.readthedocs.io/en/master/helpers.html#elasticsearch.helpers.scan） – iCart

我以前不知道。有使用扫描助手的任何工作示例？我一直在尝试，但无法弄清楚它是如何工作的。 – mshabeeb

（张贴这作为一个答案，因为代码的格式不评论工作）

尝试是这样的：

import elasticsearch 
import elasticsearch.helpers 

scanner = elasticsearch.helpers.scan(client=elasticsearch.Elasticsearch), index=my_index, query={...}, scroll='1s') 
for doc in scanner: 
    #Do something

来源

2017-01-23 13:03:31 iCart

感谢您的提示！最后，它不必使用扫描API，而是使用我在循环中执行的操作，因为我保存了从ES中检索的数据，这些数据在每次迭代时都进行了扩展，因此在某些时候内存是累 – mshabeeb

在最终我发现它与ES中的滚动无关，但这是一个内存问题。在循环内部，我将来自ES的输出保存到每次迭代扩展的数组中。所以在某个时候达到了内存限制。

来源

2017-01-23 15:42:27 mshabeeb

使用Python API在Elasticsearch中滚动时发生分段错误

回答

相关问题