2013-06-21 57 views
1

我使用apache-nutch-crawler1.6进行爬网。爬行当我尝试读取使用该命令的已爬结果的内容Nutch Crawler阅读段结果

bin/nutch readseg -dump crawl/segments/* segmentAllContent 

错误之后

Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/ubuntu/nutch/framework/apache-nutch-1.6/blogs/segments/2013062110/crawl_generate 
    Input path does not exist: file:/home/ubuntu/nutch/framework/apache-nutch-1.6/blogs/segments/2013062110/crawl_fetch 
    Input path does not exist: file:/home/ubuntu/nutch/framework/apache-nutch-1.6/blogs/segments/2013062110/crawl_parse 
    Input path does not exist: file:/home/ubuntu/nutch/framework/apache-nutch-1.6/blogs/segments/2013062110/content 
    Input path does not exist: file:/home/ubuntu/nutch/framework/apache-nutch-1.6/blogs/segments/2013062110/parse_data 
    Input path does not exist: file:/home/ubuntu/nutch/framework/apache-nutch-1.6/blogs/segments/2013062110/parse_text 
      at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) 
      at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40) 
      at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) 
      at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989) 
      at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981) 
      at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) 
      at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) 
      at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) 
      at java.security.AccessController.doPrivileged(Native Method) 
      at javax.security.auth.Subject.doAs(Subject.java:416) 
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) 
      at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) 
      at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824) 
      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261) 
      at org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:224) 
      at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:572) 

我怎么会爬后阅读HTML内容?在此先感谢

回答

3

我一般先尝试合并所有段,

斌/ Nutch的mergesegs爬行/合并爬行/段/ *

然后

斌/ Nutch的readseg转储检索/合并/ * segmentAllContent