如何使用Nutch索引NFS挂载？

我正在尝试构建一个托管在CentOS 7机器上的搜索工具，该工具应该对挂载的NFS导出目录进行索引和搜索。我发现Nutch + Solr是最好的选择。我很难配置这个网址，因为这不会搜索任何http位置。如何使用Nutch索引NFS挂载？

的安装位于在/ mnt

所以我seeds.txt看起来是这样的：

[[email protected] bin]# cat /root/Desktop/apache-nutch-1.13/urls/seed.txt 
file:///mnt

和我正则表达式，urlfilter.txt具有相同的部位加上允许文件协议

# skip file: ftp: and mailto: urls 
-^(http|https|ftp|mailto): 

# skip image and other suffixes we can't yet parse 
# for a more extensive coverage use the urlfilter-suffix plugin 
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ 

# skip URLs containing certain characters as probable queries, etc. 
-[?*[email protected]=] 

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops 
-.*(/[^/]+)/[^/]+\1/[^/]+\1/ 

# accept anything else 
+^file:///mnt

但是，当我尝试从初始种子列表引导时，没有更新完成：

[[email protected] apache-nutch-1.13]# bin/nutch inject crawl/crawldb urls 
Injector: starting at 2017-06-12 00:07:49 
Injector: crawlDb: crawl/crawldb 
Injector: urlDir: urls 
Injector: Converting injected urls to crawl db entries. 
Injector: overwrite: false 
Injector: update: false 
Injector: Total urls rejected by filters: 1 
Injector: Total urls injected after normalization and filtering: 0 
Injector: Total urls injected but already in CrawlDb: 0 
Injector: Total new urls injected: 0 
Injector: finished at 2017-06-12 00:10:27, elapsed: 00:02:38

我也试图改变seeds.txt与没有运气以下：

file:/mnt 
file:////<IP>:<export_path>

请让我知道，如果我错了，在这里做一些事情。

来源

2017-06-11 Sujay Raj

从视图中的URI点的文件系统不适合Nutch的真正不同的，你只需要启用protocol-file插件，并配置regex-urlfilter.txt这样的：

+^file:///mnt/directory/ 
-.

在这种情况下，你阻止它索引你指定的目录的父目录。

请记住，由于您已经在本地装载了NFS共享，因此它可以像普通的本地文件系统一样工作。更多信息请参见https://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F。

来源

2017-06-11 21:22:09

如果你检查你的日志，特别是'Injector：Total urls被过滤器拒绝：1'，这意味着某个URL过滤它阻止了你的URL，你可以删除/评论这一行' - 。*（/ [^ /] +）/ [^ /] + \ 1/[^ /] + \ 1/ '然后再试一次？否则，将您的规则移到文件的顶部以避免首先碰到阻塞规则。 –

请尝试将URL过滤规则更改为'+^file：/ mnt/directory /'（仅一个斜杠'file：/'），请参阅[NUTCH-1483]（https://issues.apache.org/ JIRA /浏览/ Nutch的-1483 focusedCommentId = 14176160＆页= com.atlassian.jira.plugin.system.issuetabpanels：发表评论，一个tabpanel＃评论-14176160）。我会更新教程以反映这个血腥的细节。 –

如何使用Nutch索引NFS挂载？

回答

相关问题