2013-07-03 119 views
4

我收到了一个包含一些pdf文件链接的网站。 我希望nutch抓取该链接并将它们转储为.pdf文件。 我使用的Apache Nutch1.6也是我在Java作为如何使用Apache Nutch抓取.pdf链接

ToolRunner.run(NutchConfiguration.create(), new Crawl(), 
           tokenize(crawlArg)); 
SegmentReader.main(tokenize(dumpArg)); 

特林这可以有人帮助我在此

回答

-1

你可以编写自己的自己的插件,为PDF MIME类型
或有嵌入式Apache的蒂卡分析器,可以从PDF文本检索..

3

如果你想Nutch的抓取和索引你的PDF文档,您必须启用文档爬行和提卡插件:

  1. 文献爬行

    1.1编辑正则表达式-urlfilter.txt并删除 “PDF”

    # skip image and other suffixes we can't yet parse 
    # for a more extensive coverage use the urlfilter-suffix plugin 
    -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ 
    

    1.2编辑后缀urlfilter.txt的任何发生和删除 “PDF”

    的任何occurence 1.3编辑Nutch的-site.xml中,增加了 “解析 - 蒂卡” 和 “语法分析HTML” 中的plugin.includes部分

    <property> 
        <name>plugin.includes</name> 
        <value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> 
        <description>Regular expression naming plugin directory names to 
        include. Any plugin not matching this expression is excluded. 
        In any case you need at least include the nutch-extensionpoints plugin. By 
        default Nutch includes crawling just HTML and plain text via HTTP, 
        and basic indexing and search plugins. In order to use HTTPS please enable 
        protocol-httpclient, but be aware of possible intermittent problems with the 
        underlying commons-httpclient library. 
        </description> 
    </property> 
    
  2. 如果重新什么盟友希望从一个页面下载所有的PDF文件,你可以在* nix中使用类似Teleport in Windows或Wget的东西。