Nutch Crawler花费很长时间

我只希望Nutch给我一个它爬行的URL列表以及该链接的状态。我不需要整个页面内容或绒毛。有什么办法可以做到这一点？爬取991个深度为3的网址的种子列表需要3个多小时才能抓取和解析。我希望这会加快速度。Nutch Crawler花费很长时间

在Nutch的-default.xml中的文件有

<property> 
    <name>file.content.limit</name> 
    <value>65536</value> 
    <description>The length limit for downloaded content using the file 
    protocol, in bytes. If this value is nonnegative (>=0), content longer 
    than it will be truncated; otherwise, no truncation at all. Do not 
    confuse this setting with the http.content.limit setting. 
    </description> 
</property> 

<property> 
    <name>file.content.ignored</name> 
    <value>true</value> 
    <description>If true, no file content will be saved during fetch. 
    And it is probably what we want to set most of time, since file:// URLs 
    are meant to be local and we can always use them directly at parsing 
    and indexing stages. Otherwise file contents will be saved. 
    !! NO IMPLEMENTED YET !! 
    </description> 
</property> 

<property> 
    <name>http.content.limit</name> 
    <value>65536</value> 
    <description>The length limit for downloaded content using the http 
    protocol, in bytes. If this value is nonnegative (>=0), content longer 
    than it will be truncated; otherwise, no truncation at all. Do not 
    confuse this setting with the file.content.limit setting. 
    </description> 
</property> 

<property> 
    <name>ftp.content.limit</name> 
    <value>65536</value> 
    <description>The length limit for downloaded content, in bytes. 
    If this value is nonnegative (>=0), content longer than it will be truncated; 
    otherwise, no truncation at all. 
    Caution: classical ftp RFCs never defines partial transfer and, in fact, 
    some ftp servers out there do not handle client side forced close-down very 
    well. Our implementation tries its best to handle such situations smoothly. 
    </description> 
</property>

这些属性是那些我认为可能有事情做，但我不知道。有人能给我一些帮助和澄清？此外，我收到了很多状态码为38的网址。我无法找到this文件中的状态码。谢谢您的帮助！

来源

2015-05-13 itsNino91

Nutch在获取URL后执行解析，从获取的URL获取所有outlinks。来自URL的链接将用作下一轮的新查询列表。

如果跳过解析，则不会生成新的URL，因此不会再提取。我能想到的一种方式是配置解析插件，只包含需要解析的内容类型（在你的情况下它是outlinks）。这里一个例子 - https://wiki.apache.org/nutch/IndexMetatags

此链接描述解析器https://wiki.apache.org/nutch/Features

现在的特点，只得到网址获取他们的状态，你可以使用

$bin/nutch readdb crawldb -stats命令的列表。

关于38的状态代码，看你有联系的文件，好像URL的状态是 public static final byte STATUS_FETCH_NOTMODIFIED = 0x26

因为，十六进制（26）对应至12月（38）。

希望的答案给出了一些方向:)

来源

2015-05-15 12:35:42

哇，我不知道为什么我没有想到从十六进制将其转换为十进制的身份ID，感谢那个的。我已经能够通过增加我使用的线程来显着加快速度。这已经减少了将同一爬网时间缩短到6分钟的时间。然而额外的“绒毛”仍然存在。我无法弄清楚你在这里描述的解析事情。在我的数据库中，我只想看到2个字段;一个id（被测试的url本身）和状态（在获取之后是url的状态）。 – itsNino91

'bin/nutch readdb crawldb -stats'命令仅显示按状态ID和每个URL的数量细分的总体统计数据。这不是我的最终目标，但它仍然是信息。 – itsNino91

如果您希望每个URL分解状态。使用bin/nutch readdb crawldb -stats -sort –

Nutch Crawler花费很长时间

回答

相关问题