我有一个站点托管在我的本地机器上,我试图抓取Solr中的Nutch和索引(两者都在我的本地机器上)。我根据Nutch网站(http://wiki.apache.org/nutch/NutchTutorial)上的说明安装了Solr 4.6.1和Nutch 1.7,并且我的Solr在我的浏览器中正常运行。Nutch抓取失败后Solr索引失败,报告“作业失败”
我运行下面的命令:
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 1 -topN 2
爬网工作正常,但是当它attemps把数据放入Solr的,它失败,出现以下的输出:
Indexer: starting at 2014-02-06 16:29:28
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:81)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:65)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:155)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
我去了Nutch的原木目录和尾的hadoop.log文件,它显示了这一点:
2014-02-06 16:29:28,920 INFO solr.SolrIndexWriter - Indexing 1 documents
2014-02-06 16:29:28,921 INFO httpclient.HttpMethodDirector - I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server localhost failed to respond
2014-02-06 16:29:28,921 INFO httpclient.HttpMethodDirector - Retrying request
2014-02-06 16:29:28,924 WARN mapred.LocalJobRunner - job_local331896790_0009
java.io.IOException
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.makeIOException(SolrIndexWriter.java:173)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:159)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:118)
at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:467)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:535)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
Caused by: org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Connection reset
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:478)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:155)
... 6 more
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:168)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)
at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)
at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)
at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)
at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)
at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)
at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:422)
然而,我仍然能够在我的浏览器中访问Solr就好了。这是我在Solr/Nutch上的第一次尝试 - 任何有更多知识的人的帮助都会受到大大的赞赏。谢谢。
我尝试过,但它抛出同样的错误,你可以帮助周围对我的作品。我使用的是solr 4.8和nutch 1.12 – ammu