MapReduce以文件名作为关键字，内容作为值，很多小文件

我看过FileInputFormat where filename is KEY and text contents are VALUE,How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job?和Getting Filename/FileData as key/value input for Map when running a Hadoop MapReduce Job，但我在起步时遇到了问题。之前没有对Hadoop做过任何事情，如果其他人看到我犯了错误，我会警惕开始走错路。MapReduce以文件名作为关键字，内容作为值，很多小文件

我有一个目录包含一些像100K小文件包含HTML，我想创建一个倒排索引使用Amazon Elastic MapReduce，在Java中实现。一旦我有文件内容，我知道我想要我的地图，并减少功能。

看看here后，我的理解是我需要继承FileInputFormat并覆盖isSplitable。但是，我的文件名与HTML来自的URL相关，所以我想保留它们。用文本替换NullWritable我需要做什么？任何其他建议？

来源

2015-12-07 kcmgrew

您应该使用WholeFileInputFormat整个文件传递到您的映射

conf.setInputFormat(WholeFileInputFormat.class); 
conf.setOutputFormat(TextOutputFormat.class); 
FileInputFormat.setInputPaths(conf,new Path("input")); 
FileOutputFormat.setOutputPath(conf,new Path("output"));

来源

2015-12-07 08:47:07

MapReduce以文件名作为关键字，内容作为值，很多小文件

回答

相关问题