2016-11-01 45 views
1

我正在使用Apache Nutch 1.12和Apache Solr 6.2.1在Internet上搜索数据并对它们进行索引,并且组合给出错误:java.lang.Exception :java.lang.IllegalStateException:连接池关闭Apache Nutch 1.12与Apache Solr 6.2.1给出错误

我做了以下为我从Nutch的教程中了解到:https://wiki.apache.org/nutch/NutchTutorial

  • 复制的Nutch的schema.xml中,并把它放在Solr的config文件夹
  • 放置一个种子url(一个newspap ER公司)在URL中的Nutch的/ seed.txt
  • 改变http.content.limit值设置为 “-1” 的nutch-site.xml中。由于种子网址是报业公司的一个,我不得不elimiate HTTP内容下载大小限制

当我运行下面的命令,我得到一个错误:

bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TSolr urls/ TestCrawl/ 2 

以上,TSolr只是你可能已经猜到的Solr核心的名称。

我粘贴错误日志下面hadoop.log:

2016-10-28 16:21:20,982 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: TestCrawl/crawldb 
2016-10-28 16:21:20,982 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: TestCrawl/linkdb 
2016-10-28 16:21:20,982 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: TestCrawl/segments/20161028161642 
2016-10-28 16:21:46,353 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/staging/btaek1281422650/.staging/job_local1281422650_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 
2016-10-28 16:21:46,355 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/staging/btaek1281422650/.staging/job_local1281422650_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 
2016-10-28 16:21:46,415 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/local/localRunner/btaek/job_local1281422650_0001/job_local1281422650_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 
2016-10-28 16:21:46,416 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/local/localRunner/btaek/job_local1281422650_0001/job_local1281422650_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 
2016-10-28 16:21:46,565 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 
2016-10-28 16:21:52,308 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter 
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: content dest: content 
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: title dest: title 
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: host dest: host 
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: segment dest: segment 
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: boost dest: boost 
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: digest dest: digest 
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: tstamp dest: tstamp 
2016-10-28 16:21:52,424 INFO solr.SolrIndexWriter - Indexing 42/42 documents 
2016-10-28 16:21:52,424 INFO solr.SolrIndexWriter - Deleting 0 documents 
2016-10-28 16:21:53,468 INFO solr.SolrMappingReader - source: content dest: content 
2016-10-28 16:21:53,468 INFO solr.SolrMappingReader - source: title dest: title 
2016-10-28 16:21:53,468 INFO solr.SolrMappingReader - source: host dest: host 
2016-10-28 16:21:53,468 INFO solr.SolrMappingReader - source: segment dest: segment 
2016-10-28 16:21:53,468 INFO solr.SolrMappingReader - source: boost dest: boost 
2016-10-28 16:21:53,468 INFO solr.SolrMappingReader - source: digest dest: digest 
2016-10-28 16:21:53,469 INFO solr.SolrMappingReader - source: tstamp dest: tstamp 
2016-10-28 16:21:53,472 INFO indexer.IndexingJob - Indexer: number of documents indexed, deleted, or skipped: 
2016-10-28 16:21:53,476 INFO indexer.IndexingJob - Indexer:  42 indexed (add/update) 
2016-10-28 16:21:53,477 INFO indexer.IndexingJob - Indexer: finished at 2016-10-28 16:21:53, elapsed: 00:00:32 
2016-10-28 16:21:54,199 INFO indexer.CleaningJob - CleaningJob: starting at 2016-10-28 16:21:54 
2016-10-28 16:21:54,344 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 
2016-10-28 16:22:19,739 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/staging/btaek1653313730/.staging/job_local1653313730_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 
2016-10-28 16:22:19,741 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/staging/btaek1653313730/.staging/job_local1653313730_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 
2016-10-28 16:22:19,797 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/local/localRunner/btaek/job_local1653313730_0001/job_local1653313730_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 
2016-10-28 16:22:19,799 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/local/localRunner/btaek/job_local1653313730_0001/job_local1653313730_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 
2016-10-28 16:22:19,807 WARN output.FileOutputCommitter - Output Path is null in setupJob() 
2016-10-28 16:22:25,113 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter 
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: content dest: content 
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: title dest: title 
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: host dest: host 
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: segment dest: segment 
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: boost dest: boost 
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: digest dest: digest 
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: tstamp dest: tstamp 
2016-10-28 16:22:25,191 INFO solr.SolrIndexWriter - SolrIndexer: deleting 6/6 documents 
2016-10-28 16:22:25,300 WARN output.FileOutputCommitter - Output Path is null in cleanupJob() 
2016-10-28 16:22:25,301 WARN mapred.LocalJobRunner - job_local1653313730_0001 
java.lang.Exception: java.lang.IllegalStateException: Connection pool shut down 
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) 
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) 
Caused by: java.lang.IllegalStateException: Connection pool shut down 
    at org.apache.http.util.Asserts.check(Asserts.java:34) 
    at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169) 
    at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202) 
    at org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184) 
    at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415) 
    at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) 
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) 
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) 
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) 
    at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:480) 
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241) 
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230) 
    at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150) 
    at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:483) 
    at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:464) 
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:190) 
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178) 
    at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) 
    at org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120) 
    at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237) 
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459) 
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) 
    at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) 
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 
2016-10-28 16:22:25,841 ERROR indexer.CleaningJob - CleaningJob: java.io.IOException: Job failed! 
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) 
    at org.apache.nutch.indexer.CleaningJob.delete(CleaningJob.java:172) 
    at org.apache.nutch.indexer.CleaningJob.run(CleaningJob.java:195) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) 
    at org.apache.nutch.indexer.CleaningJob.main(CleaningJob.java:206) 

正如你可以在上面的bin /爬行命令看,我有Nutch的运行爬2个回合。事情是,上面的错误只发生在第二轮(种子网站的1级以上)。因此,索引在第一轮中成功运行,但在第二轮抓取并解析后,它将吐出错误并停止。

要从第一轮尝试的东西有点不同,因为我在上面所做的,我做了第二次运行执行以下操作:

  • 删除TestCrawl文件夹开始抓取和索引全新
  • RAN: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TSolr urls/ TestCrawl/ 1 ==>请注意,我已将Nutch的轮次数更改为“1”。而且,这种执行抓取和索引成功
  • 然后,再次运行相同的命令进行第二轮抓取1个更深一层:bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TSolr urls/ TestCrawl/ 1 ==>正如我上面贴的hadoop.log这给了我同样的错误!

因此,对于我的Solr而言,无法成功索引Nutch在种子网站的第二轮或更深处爬行的内容。

可能的错误是由于种子站点的解析内容的大小?种子网站是一家报纸公司的网站,所以我相信第二轮(更深一层)将包含大量的数据解析为索引。如果问题是分析的内容大小,我如何配置我的Solr来解决问题?

如果错误是从别的东西,有人可以请帮我鉴定它是什么,以及如何解决它?

回答

2

对于那些谁经历过的事情,我都经历过,我以为我会发布的解决方案,我是有这个问题。

拳的是,APACH Nutch的1.12似乎并不支持Apache Solr实现6.X的如果您查看Apache Nutch 1.12发行说明,他们最近添加了支持Apache Solr 5.X到Nuch 1.12的功能,并且不包括对Solr 6.X的支持。因此,我决定使用Solr 5.5.3来代替Solr 6.2.1。因此,我安装了Apache Solr 5.5.3以与Apache Nutch一起工作1.12

正如Jorge Luis所指出的那样,Apache Nutch 1.12有一个错误,当它与Apache Solr一起工作时会出错。他们会修正bug并在某些时候发布Nutch 1.13,但我不知道什么时候会这样,所以我决定自己修复这个bug。

为什么我得到了错误的原因是因为在CleaningJob.java(的Nutch的)close方法被调用,然后再提交方法。然后,抛出以下异常:java.lang.IllegalStateException:连接池关闭。

修复其实很简单。要了解解决方案,请转到此处:https://github.com/apache/nutch/pull/156/commits/327e256bb72f0385563021995a9d0e96bb83c4f8

正如您在上面的链接中看到的,您只需要重新定位“writers.close();”方法。

顺便说一句,为了修正这个错误,你将需要Nutch的SCR套餐二进制包,因为你将不能够编辑CleaningJob.java文件中Nutch的二进制软件包。解决之后,运行蚂蚁,你就全都设置好了。

修复程序后,我不再得到的错误!

希望这有助于谁正面临着我面临的问题的人。