如何在运行时在HADOOP中生成多个文件名？

如K1，K2，数据1，数据2，数据3

这里我映射器传递的关键在于减速的K1K2 &值数据1，数据2，数据3

我想保存在多个文件中这一数据文件名为K1k2（或减速器获取的键）。现在如果我使用MultipleOutputs类，我必须在映射器开始之前提及文件名。但在这里，因为只有在读取来自mapper的数据之后，我才能确定密钥。我应该如何继续？

PS我是新来的。

2014-02-11 Sanchit

您可以生成的文件名，并将其传递给MultipleOutputs在减速这样的：

public void setup(Context context) { 
    out = new MultipleOutputs(context); 
    ... 
} 

public void reduce(Text key, Iterable values, Context context) throws IOException,   InterruptedException { 
    for (Text t : values) { 
    out.write(key, t, generateFileName(<parameter list...>)); 
    // generateFileName is your function 
    } 
} 

protected void cleanup(Context context) throws IOException, InterruptedException { 
    out.close(); 
}

有关详细信息阅读MultipleOutputs类参考：https://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

来源

2014-02-11 13:43:50

没有，但它给出了一个错误java.lang.IllegalArgumentException异常：命名输出“K1K2”不org.apache.hadoop.mapreduce.lib.output.MultipleOutputs定义 \t。 checkNamedOutputName（MultipleOutputs.java:193） – Sanchit

如果我添加MultipleOutputs.addNamedOutput（job，FileName1.toString（），TextOutputFormat.class，NullWritable.class，Text.class）;在generateOutput（）方法中，我如何在减速器中获得工作。我刚开始这可能是一个非常基本的问题？ – Sanchit

不需要命名输出。看看我的帖子 –

-1

无需预定义的输出文件名。这里你可以像这样使用MultipleOutputs。

public class YourReducer extends Reducer<Text, Value, Text, Value> { 
private Value result = null; 
private MultipleOutputs<Text,Value> out; 

public void setup(Context context) { 
    out = new MultipleOutputs<Text,Value>(context);  
} 
public void reduce(Text key, Iterable<Value> values, Context context) 
     throws IOException, InterruptedException { 
    // do your code 
    out.write(key, result,"outputpath/"+key.getText());     
} 
public void cleanup(Context context) throws IOException,InterruptedException { 
    out.close();   
}

}

这给出了以下路径输出

outputpath/K1 
      /K2 
      /K3 
.......

为此，您应该使用LazyOutputFormat.setOutputFormatClass()，而不是FileOutputFormat。还需要添加作业配置为job.setOutputFormatClass(NullOutputFormat.class)。但不要忘记像以前一样使用FileOutputFormat.setOutputPath()和FileOutputFormat.setOutputPath()来输入和输出路径。然后将生成的文件将相对于指定outputpath

来源

2014-02-12 09:02:22

...并且您必须在运行作业的'驱动程序'中定义MultipleOutputs。正确？ – OhadR

你的意思是定义多个输出和驱动程序？ –

运行作业的文件必须调用MultipleOutputs.addNamedOutput（job，...，TextOutputFormat.class，...） – OhadR

如何在运行时在HADOOP中生成多个文件名？

回答

相关问题