我刚刚开始使用elasticsearch。我们的要求让我们需要索引数千个PDF文件,并且我很难让其中一个成功索引。Elasticsearch在尝试索引时解析异常错误PDF
安装了附件类型插件并得到响应:Installed mapper-attachments
。
跟着Attachment Type in Action tutorial但进程挂起和我不知道如何解释错误信息。也尝试了在同一个地方挂起的gist。
$ curl -X POST "localhost:9200/test/attachment/" -d json.file
{"error":"ElasticSearchParseException[Failed to derive xcontent from (offset=0, length=9): [106, 115, 111, 110, 46, 102, 105, 108, 101]]","status":400}
更多细节:
的json.file
包含一个嵌入式的Base64 PDF文件(按说明)。文件的第一行出现正确的(我反正):{"file":"JVBERi0xLjQNJeLjz9MNCjE1OCAwIG9iaiA8
...
我不知道也许json.file
无效,或者如果可能elasticsearch只是没有设置正确地解析PDF文件? !?
编码 - 这里是我们如何编码的PDF到json.file
(按教程):
coded=`cat fn6742.pdf | perl -MMIME::Base64 -ne 'print encode_base64($_)'`
json="{\"file\":\"${coded}\"}"
echo "$json" > json.file
也试过:
coded=`openssl base64 -in fn6742.pdf
日志:
[2012-06-07 12:32:16,742][DEBUG][action.index ] [Bailey, Paul] [test][0], node[AHLHFKBWSsuPnTIRVhNcuw], [P], s[STARTED]: Failed to execute [index {[test][attachment][DauMB-vtTIaYGyKD4P8Y_w], source[json.file]}]
org.elasticsearch.ElasticSearchParseException: Failed to derive xcontent from (offset=0, length=9): [106, 115, 111, 110, 46, 102, 105, 108, 101]
at org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:147)
at org.elasticsearch.common.xcontent.XContentHelper.createParser(XContentHelper.java:50)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:451)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:437)
at org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:290)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:210)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:532)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:430)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
希望有人能帮我看看我失踪或做错了什么?
啊,你是对的!谢谢你的帮助!但是,现在我已经尝试在文件名前添加'@',它只是挂起而没有输出到日志?!?我需要* ctrl-C *来取回我的外壳。有任何想法吗?也许一种使日志更有帮助的方法? – Meltemi
你可以运行jstack并查看它挂起的位置吗? – imotov
我有同样的错误。谢谢! – ssoto