如何扩展Nutch文章抓取

我正在寻找一个框架来抓取文章，然后我找到Nutch 2.1。这是我的计划，问题在每个：如何扩展Nutch文章抓取

添加文章列表页面URL进入/ seed.txt 这里有一个问题。我真正想要索引的是文章页面，而不是文章列表页面。但是，如果我不允许将列表页面编入索引，Nutch将不会执行任何操作，因为列表页面是入口。那么，我怎样才能索引没有列表页面的文章页面呢？

写一个插件来解析出“作者”，“日期”，“文章正文”，“标题”，并从HTML也许其他信息。在Nutch的2.1 '分析器' 插件接口：解析getParse（字符串URL，网页页面）和 '好康' 类有一些预定义的attributs：

public class WebPage extends PersistentBase { 
    // ... 
    private Utf8 baseUrl; 
    // ... 
    private ByteBuffer content; // <== This becomes null in IndexFilter 
    // ... 
    private Utf8 title; 
    private Utf8 text; 
    // ... 
    private Map<Utf8,Utf8> headers; 
    private Map<Utf8,Utf8> outlinks; 
    private Map<Utf8,Utf8> inlinks; 
    private Map<Utf8,Utf8> markers; 
    private Map<Utf8,ByteBuffer> metadata; 
    // ... 
} 

So, as you can see, there are 5 maps I can put my specified attributes in. But, 'headers', 'outlinks', 'inlinks' seem not used for this. Maybe I could put those information into markers or metadata. Are they designed for this purpose? 
BTW, the Parser in trunk looks like: 'public ParseResult getParse(Content content)', and seems more reasonable for me.

的文章后索引到Solr中，另一个应用程序可以通过'date'查询它，然后将文章信息存储到Mysql中。这里我的问题是：Nutch可以直接将文章存储到Mysql中吗？或者我可以编写一个插件来指定索引行为？

Nutch是我的目的不错的选择吗？如果没有，你们会为我建议另一个高质量的框架/库吗？感谢您的帮助。

来源

2012-12-15 user1633272

如果从几个网站文章提取是所有你所寻找的，然后检查了http://www.crawl-anywhere.com/

它配备了一个管理界面，你可以指定要使用boilerpipe文章提取（这是伟大的）。您还可以通过URL模式指定要匹配哪些网页，以及要抓取并编制索引的网页。

来源

2012-12-27 18:27:25 Hari

Inside Crawl Anywhere文档我无法找到功能，通过它可以指定仅提取文章正文（而不是整个html网页正文）。 –

如何扩展Nutch文章抓取

回答

相关问题