2012-03-01 80 views
1

http://wiki.apache.org/nutch/NutchTutorialhttp://www.nutchinstall.blogspot.com/Nutch的路径错误

当我走命令

bin/nutch crawl urls -dir crawl -depth 3 -topN 5 

我有

LinkDb: adding segment: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/201203
LinkDb: adding segment: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/201203
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301221729/parse_data 
Input path does not exist: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301221754/parse_data 
Input path does not exist: file:/C:/cygwin/home/LeHung/apache-nutch-1.4-bin/runtime/local/crawl/segments/20120301221804/parse_data 
     at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) 
     at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) 
     at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) 
     at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) 
     at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) 
     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) 
     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) 
     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) 
     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149) 
     at org.apache.nutch.crawl.Crawl.run(Crawl.java:143) 
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) 
     at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) 

我使用的cygwin,窗口运行的nutch此错误

+0

看起来像没有所需的parse_data子文件夹段内有一些目录。你有没有运行爬网之前,然后删除一些目录? – javanna 2012-03-02 09:18:12

+0

我有同样的问题 – Aftershock 2012-03-21 21:37:53

回答

0

我有类似的问题。 我删除了数据库和目录。之后,它运行良好。

+0

它似乎适合我。但是如果我爬了好几天,那就不会那么好。 – MrROY 2012-03-23 07:36:00