Hadoop中的文本阅读器类

我有一个目录OUTPUT，其中有来自Map Reduce作业的输出文件。输出文件是使用TextOutputFormat编写的文本文件。Hadoop中的文本阅读器类

现在我想从输出文件中读取键值对。我如何使用hadoop中的一些现有类来做到这一点。我可以做的一种方法是如下

FileSystem fs = FileSystem.get(conf); 
FileStatus[] files = fs.globStatus(new Path(OUTPUT + "/part-*")); 
for(FileStatus file:files){ 
    if(file.getLen() > 0){ 
    FSDataInputStream in = fs.open(file.getPath()); 
    BufferedReader bin = new BufferedReader(new InputStreamReader(
     in)); 
    String s = bin.readLine(); 
    while(s!=null){ 
     System.out.println(s); 
     s = bin.readLine(); 
    } 
    in.close(); 
    } 
}

这种做法会工作，但增加了我的任务很大，因为我现在需要手动解析键值对出每根线的。我正在寻找更方便的东西，直接让我读取一些变量中的键和值。

来源

2012-06-12 Apurv

下面是hadoop中读者类的列表 - http://www.buggybread.com/2015/09/apache-hadoop-list-of-reader-classes.html。这可能有帮助。 –

您是否被迫在上一份工作中使用TextOutputFormat作为输出格式？

如果没有，则考虑使用SequenceFileOutputFormat，然后您可以使用SequenceFile.Reader以键/值对读回文件。您也可以仍然“查看”使用hadoop fs -text path/to/output/part-r-00000

编辑文件：您还可以使用KeyValueLineRecordReader类，你只需要在FileSplit传递给德构造函数。

来源

2012-06-12 15:39:10

我正在使用TextOutputFormat，因为我需要输出文件是人类可读的。我已经考虑了你的建议，谢谢，那将是我最后的选择。 – Apurv

Hadoop中的文本阅读器类

回答

相关问题