2014-02-07 85 views
8

我有一个站点托管在我的本地机器上,我试图抓取Solr中的Nutch和索引(两者都在我的本地机器上)。我根据Nutch网站(http://wiki.apache.org/nutch/NutchTutorial)上的说明安装了Solr 4.6.1和Nutch 1.7,并且我的Solr在我的浏览器中正常运行。Nutch抓取失败后Solr索引失败,报告“作业失败”

我运行下面的命令:

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 1 -topN 2 

爬网工作正常,但是当它attemps把数据放入Solr的,它失败,出现以下的输出:

Indexer: starting at 2014-02-06 16:29:28 
Indexer: deleting gone documents: false 
Indexer: URL filtering: false 
Indexer: URL normalizing: false 
Active IndexWriters : 
SOLRIndexWriter 
    solr.server.url : URL of the SOLR instance (mandatory) 
    solr.commit.size : buffer size when sending to SOLR (default 1000) 
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) 
    solr.auth : use authentication (default false) 
    solr.auth.username : use authentication (default false) 
    solr.auth : username for authentication 
    solr.auth.password : password for authentication 


Exception in thread "main" java.io.IOException: Job failed! 
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) 
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123) 
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:81) 
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:65) 
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:155) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) 
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) 

我去了Nutch的原木目录和尾的hadoop.log文件,它显示了这一点:

2014-02-06 16:29:28,920 INFO solr.SolrIndexWriter - Indexing 1 documents 
2014-02-06 16:29:28,921 INFO httpclient.HttpMethodDirector - I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server localhost failed to respond 
2014-02-06 16:29:28,921 INFO httpclient.HttpMethodDirector - Retrying request 
2014-02-06 16:29:28,924 WARN mapred.LocalJobRunner - job_local331896790_0009 
java.io.IOException 
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.makeIOException(SolrIndexWriter.java:173) 
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:159) 
    at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:118) 
    at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44) 
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:467) 
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:535) 
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421) 
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398) 
Caused by: org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Connection reset 
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:478) 
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) 
    at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) 
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:155) 
    ... 6 more 
Caused by: java.net.SocketException: Connection reset 
    at java.net.SocketInputStream.read(SocketInputStream.java:168) 
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) 
    at java.io.BufferedInputStream.read(BufferedInputStream.java:237) 
    at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78) 
    at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106) 
    at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116) 
    at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973) 
    at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735) 
    at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098) 
    at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398) 
    at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) 
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) 
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) 
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:422) 

然而,我仍然能够在我的浏览器中访问Solr就好了。这是我在Solr/Nutch上的第一次尝试 - 任何有更多知识的人的帮助都会受到大大的赞赏。谢谢。

回答

2

发生这种情况时,并非来自nutch的所有必填字段都在solr的schema.xml中。您是否添加了Nutch的schema.xml的字段?

如果在一节“田”下面的添加,事情应该工作:

<field name="id" type="string" stored="true" indexed="true"/> 
<!-- core fields --> 
<field name="segment" type="string" stored="true" indexed="false"/> 
<field name="digest" type="string" stored="true" indexed="false"/> 
<field name="boost" type="float" stored="true" indexed="false"/> 

<!-- fields for index-basic plugin --> 
<field name="host" type="string" stored="false" indexed="true"/> 
<field name="url" type="url" stored="true" indexed="true" 
    required="true"/> 
<field name="content" type="text_general" stored="false" indexed="true"/> 
<field name="title" type="text_general" stored="true" indexed="true"/> 
<field name="cache" type="string" stored="true" indexed="false"/> 
<field name="tstamp" type="date" stored="true" indexed="false"/> 

<!-- fields for index-anchor plugin --> 
<field name="anchor" type="string" stored="true" indexed="true" 
    multiValued="true"/> 

<!-- fields for index-more plugin --> 
<field name="type" type="string" stored="true" indexed="true" 
    multiValued="true"/> 
<field name="contentLength" type="long" stored="true" 
    indexed="false"/> 
<field name="lastModified" type="date" stored="true" 
    indexed="false"/> 
<field name="date" type="date" stored="true" indexed="true"/> 

<!-- fields for languageidentifier plugin --> 
<field name="lang" type="string" stored="true" indexed="true"/> 

<!-- fields for subcollection plugin --> 
<field name="subcollection" type="string" stored="true" 
    indexed="true" multiValued="true"/> 

<!-- fields for feed plugin (tag is also used by microformats-reltag)--> 
<field name="author" type="string" stored="true" indexed="true"/> 
<field name="tag" type="string" stored="true" indexed="true" multiValued="true"/> 
<field name="feed" type="string" stored="true" indexed="true"/> 
<field name="publishedDate" type="date" stored="true" 
    indexed="true"/> 
<field name="updatedDate" type="date" stored="true" 
    indexed="true"/> 

<!-- fields for creativecommons plugin --> 
<field name="cc" type="string" stored="true" indexed="true" 
    multiValued="true"/> 

<!-- fields for tld plugin -->  
<field name="tld" type="string" stored="false" indexed="false"/> 
0

我也有类似的问题,Nutch的1.8和Solr 4.8.0。事实上,Diaa的回答帮助我解决了这个问题。删除了schema.xml与Diaa的字段列表的一些交集,并且在更改了标记为“由wb添加”和“由wb改变”的两个条目之后,我结束了以下字段列表,这些字段列表适用于我。与早期版本的nutch和solr相比,“字段”没有标签。标记为“字段”的条目仅在“模式”内。这是完整的字段列表:

<field name="_root_" type="string" indexed="true" stored="false"/> 

    <!-- Only remove the "id" field if you have a very good reason to. While not strictly 
    required, it is highly recommended. A <uniqueKey> is present in almost all Solr 
    installations. See the <uniqueKey> declaration below where <uniqueKey> is set to "id". 
    --> 
    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 

    <field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/> 
    <field name="name" type="text_general" indexed="true" stored="true"/> 
    <field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/> 
    <field name="cat" type="string" indexed="true" stored="true" multiValued="true"/> 
    <field name="features" type="text_general" indexed="true" stored="true" multiValued="true"/> 
    <field name="includes" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" /> 

    <field name="weight" type="float" indexed="true" stored="true"/> 
    <field name="price" type="float" indexed="true" stored="true"/> 
    <field name="popularity" type="int" indexed="true" stored="true" /> 
    <field name="inStock" type="boolean" indexed="true" stored="true" /> 

    <field name="store" type="location" indexed="true" stored="true"/> 

    <!-- Common metadata fields, named specifically to match up with 
    SolrCell metadata when parsing rich documents such as Word, PDF. 
    Some fields are multiValued only because Tika currently may return 
    multiple values for them. Some metadata is parsed from the documents, 
    but there are some which come from the client context: 
     "content_type": From the HTTP headers of incoming stream 
     "resourcename": From SolrCell request param resource.name 
    --> 
    <field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/> 
    <field name="subject" type="text_general" indexed="true" stored="true"/> 
    <field name="description" type="text_general" indexed="true" stored="true"/> 
    <field name="comments" type="text_general" indexed="true" stored="true"/> 
    <field name="author" type="text_general" indexed="true" stored="true"/> 
    <field name="keywords" type="text_general" indexed="true" stored="true"/> 
    <field name="category" type="text_general" indexed="true" stored="true"/> 
    <field name="resourcename" type="text_general" indexed="true" stored="true"/> 

    <!-- added by wb: required="true" --> 
    <field name="url" type="text_general" indexed="true" stored="true" required="true"/> 

    <field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/> 
    <field name="last_modified" type="date" indexed="true" stored="true"/> 
    <field name="links" type="string" indexed="true" stored="true" multiValued="true"/> 

    <!-- Main body of document extracted by SolrCell. 
     NOTE: This field is not indexed by default, since it is also copied to "text" 
     using copyField below. This is to save space. Use this field for returning and 
     highlighting document content. Use the "text" field to search the content. --> 

    <!-- changedby wb: indexed="true" --> 
    <field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/> 


    <!-- catchall field, containing all other searchable text fields (implemented 
     via copyField further on in this schema --> 
    <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/> 

    <!-- catchall text field that indexes tokens both normally and in reverse for efficient 
     leading wildcard queries. --> 
    <field name="text_rev" type="text_general_rev" indexed="true" stored="false" multiValued="true"/> 

    <!-- non-tokenized version of manufacturer to make it easier to sort or group 
     results by manufacturer. copied from "manu" via copyField --> 
    <field name="manu_exact" type="string" indexed="true" stored="false"/> 

    <field name="payloads" type="payloads" indexed="true" stored="true"/> 

    <!-- Fields needed for Nutch 1.8 integration: --> 

    <field name="segment" type="string" stored="true" indexed="false"/> 
    <field name="digest" type="string" stored="true" indexed="false"/> 
    <field name="boost" type="float" stored="true" indexed="false"/> 

    <!-- fields for index-basic plugin --> 
    <field name="host" type="string" stored="false" indexed="true"/> 
    <field name="cache" type="string" stored="true" indexed="false"/> 
    <field name="tstamp" type="date" stored="true" indexed="false"/> 

    <!-- fields for index-anchor plugin --> 
    <field name="anchor" type="string" stored="true" indexed="true" multiValued="true"/> 

    <!-- fields for index-more plugin --> 
    <field name="type" type="string" stored="true" indexed="true" multiValued="true"/> 
    <field name="contentLength" type="long" stored="true" indexed="false"/> 
    <field name="lastModified" type="date" stored="true" indexed="false"/> 
    <field name="date" type="date" stored="true" indexed="true"/> 

    <!-- fields for languageidentifier plugin --> 
    <field name="lang" type="string" stored="true" indexed="true"/> 

    <!-- fields for subcollection plugin --> 
    <field name="subcollection" type="string" stored="true" indexed="true" multiValued="true"/> 

    <!-- fields for feed plugin (tag is also used by microformats-reltag)--> 
    <field name="tag" type="string" stored="true" indexed="true" multiValued="true"/> 
    <field name="feed" type="string" stored="true" indexed="true"/> 
    <field name="publishedDate" type="date" stored="true" indexed="true"/> 
    <field name="updatedDate" type="date" stored="true" indexed="true"/> 

    <!-- fields for creativecommons plugin --> 
    <field name="cc" type="string" stored="true" indexed="true" multiValued="true"/> 

    <!-- fields for tld plugin -->  
    <field name="tld" type="string" stored="false" indexed="false"/> 

    <!-- End of fields needed for Nutch 1.8 integration: --> 
+1

我尝试过,但它抛出同样的错误,你可以帮助周围对我的作品。我使用的是solr 4.8和nutch 1.12 – ammu

0

你好,我知道这个问题是旧的,但使用的Nutch和Solr在2017版本(Nutch的1.13,Solr的5.5.0)的人,我有同样的问题,我只是用以下溶液解决

仓/爬行-i -D solr.server.url = http://localhost:8983/solr/#/nutch网址/ TestCrawl2/1

以上是命令IA使用用于抓取,但是我有当我使用此相同的错误

bin/crawl -i -D solr.server.url = http://localhost:8983/solr/nutch个网址TestCrawl2 2

我只是删除了“/”后的网址/ TestCrawl2 /,它 感谢