我只希望Nutch给我一个它爬行的URL列表以及该链接的状态。我不需要整个页面内容或绒毛。有什么办法可以做到这一点?爬取991个深度为3的网址的种子列表需要3个多小时才能抓取和解析。我希望这会加快速度。Nutch Crawler花费很长时间
在Nutch的-default.xml中的文件有
<property>
<name>file.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content using the file
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the http.content.limit setting.
</description>
</property>
<property>
<name>file.content.ignored</name>
<value>true</value>
<description>If true, no file content will be saved during fetch.
And it is probably what we want to set most of time, since file:// URLs
are meant to be local and we can always use them directly at parsing
and indexing stages. Otherwise file contents will be saved.
!! NO IMPLEMENTED YET !!
</description>
</property>
<property>
<name>http.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content using the http
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
<property>
<name>ftp.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
otherwise, no truncation at all.
Caution: classical ftp RFCs never defines partial transfer and, in fact,
some ftp servers out there do not handle client side forced close-down very
well. Our implementation tries its best to handle such situations smoothly.
</description>
</property>
这些属性是那些我认为可能有事情做,但我不知道。有人能给我一些帮助和澄清?此外,我收到了很多状态码为38的网址。我无法找到this文件中的状态码。谢谢您的帮助!
哇,我不知道为什么我没有想到从十六进制将其转换为十进制的身份ID,感谢那个的。我已经能够通过增加我使用的线程来显着加快速度。这已经减少了将同一爬网时间缩短到6分钟的时间。然而额外的“绒毛”仍然存在。我无法弄清楚你在这里描述的解析事情。在我的数据库中,我只想看到2个字段;一个id(被测试的url本身)和状态(在获取之后是url的状态)。 – itsNino91
'bin/nutch readdb crawldb -stats'命令仅显示按状态ID和每个URL的数量细分的总体统计数据。这不是我的最终目标,但它仍然是信息。 – itsNino91
如果您希望每个URL分解状态。使用bin/nutch readdb crawldb -stats -sort –