使用Apache Nutch解析PDF问题 - 提取器插件

我正在尝试从网站索引网页和pdf文档。我正在使用Nutch 1.9。使用Apache Nutch解析PDF问题 - 提取器插件

我从https://github.com/BayanGroup/nutch-custom-search下载了nutch-custom-search插件。这个插件非常棒，确实让我匹配选定的divs到solr fieds。

我遇到的问题是，我的网站还包含许多PDF文件。我可以看到他们被抓取但从未解析。查询solr时没有pdf。只是网页。我试图用蒂卡解析.PDFs（我希望我有个好主意）

如果在Cygwin上，我跑parsechecker见下文，似乎就OK解析：

$ bin/nutch parsechecker -dumptext -forceAs application/pdf http://www.immunisationscotland.org.uk/uploads/documents/18304-Tuberculosis.pdf

我不是不太清楚下一步是什么（请参阅下面的我的配置）

extractor.xml做

<config xmlns="http://bayan.ir" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://bayan.ir http://raw.github.com/BayanGroup/nutch-custom-search/master/zal.extractor/src/main/resources/extractors.xsd" omitNonMatching="true"> 
<fields> 
    <field name="pageTitleChris" /> 
    <field name="contentChris" />  
</fields> 
<documents> 
    <document url="^.*\.(?!pdf$)[^.]+$" engine="css"> 
     <extract-to field="pageTitleChris"> 
      <text> 
       <expr value="head > title" /> 
      </text> 

     </extract-to> 
     <extract-to field="contentChris"> 
      <text> 
       <expr value="#primary-content" /> 
      </text> 

     </extract-to> 

    </document> 
</documents>

在我的解析，plugins.xml我加入

<mimeType name="application/pdf"> 
     <plugin id="parse-tika" /> 
    </mimeType>

Nutch的-site.xml中

<name>plugin.includes</name> 
<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|extractor|index-(basic|anchor)|query-(basic|site|url)|indexer-solr|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

<property> 
    <name>http.content.limit</name> 
    <value>65536666</value> 
    <description></description> 
</property> 


<property> 
    <name>extractor.file</name> 
    <value>extractor.xml</value> 
</property>

帮助将不胜感激，

感谢

克里斯

来源

2014-11-24 cgoasduff

我认为这个问题涉及到omitNonMatching在extractor.xml文件= “真”。

omitNonMatching =“true”的意思是“不要将那些不符合extractor.xml规则的任何提取的页面编入索引”。默认值是false。

来源

2014-11-25 13:17:34 tahagh

嗨塔哈，感谢您的回复。恐怕只是将omitNonMatching设置为false并不能解决问题。 :(事实上，在阅读Github上有关类似问题的文章后，我添加了这个选项。https://github.com/BayanGroup/nutch-custom-search/issues/1有没有其他想法？谢谢Chris – cgoasduff 2014-11-25 13:50:30

你有没有在模式1（作为HTTP解析器插件）或模式2（作为独立的解析器）配置插件？（我猜你正在使用模式1，但我想确定） – tahagh 2014-11-25 14:30:56

另一个问题：如果你禁用提取器插件，将您的pdf文件正常索引到solr？ – tahagh 2014-11-25 14:50:17

使用Apache Nutch解析PDF问题 - 提取器插件

回答

相关问题