2012-12-16 24 views
3

我开发了一个Map Reduce应用程序,以根据Donald Miner编写的书籍确定用户第一次和最后一次发布的评论以及该用户发表的评论总数。使用地图缩减的最小最大数量

但我的算法的问题是减速机。我根据用户标识对评论进行了分组。我的测试数据包含两个用户ID,每个用户在不同日期发布3条评论。因此共有6行。

所以我的减速器输出应该打印两条记录,每条记录显示用户首先和最后一次评论以及每个用户标识的总评论。

但是,我的减速器正在打印六条记录。有人能指出下面的代码有什么问题吗?

import java.io.IOException; 
import java.text.SimpleDateFormat; 
import java.util.Date; 
import java.util.Map; 

import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.mapreduce.Job; 
import org.apache.hadoop.mapreduce.Mapper; 
import org.apache.hadoop.mapreduce.Reducer; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
import org.apache.hadoop.util.GenericOptionsParser; 
import org.arjun.mapreduce.patterns.mapreducepatterns.MRDPUtils; 

import com.sun.el.parser.ParseException; 

public class MinMaxCount { 

    public static class MinMaxCountMapper extends 
      Mapper<Object, Text, Text, MinMaxCountTuple> { 

     private Text outuserId = new Text(); 
     private MinMaxCountTuple outTuple = new MinMaxCountTuple(); 

     private final static SimpleDateFormat sdf = 
        new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSS"); 

     @Override 
     protected void map(Object key, Text value, 
       org.apache.hadoop.mapreduce.Mapper.Context context) 
       throws IOException, InterruptedException { 

      Map<String, String> parsed = 
        MRDPUtils.transformXMLtoMap(value.toString()); 

      String date = parsed.get("CreationDate"); 
      String userId = parsed.get("UserId"); 

      try { 
       Date creationDate = sdf.parse(date); 
       outTuple.setMin(creationDate); 
       outTuple.setMax(creationDate); 
      } catch (java.text.ParseException e) { 
       System.err.println("Unable to parse Date in XML"); 
       System.exit(3); 
      } 

      outTuple.setCount(1); 
      outuserId.set(userId); 

      context.write(outuserId, outTuple); 

     } 

    } 

    public static class MinMaxCountReducer extends 
      Reducer<Text, MinMaxCountTuple, Text, MinMaxCountTuple> { 

     private MinMaxCountTuple result = new MinMaxCountTuple(); 


     protected void reduce(Text userId, Iterable<MinMaxCountTuple> values, 
       org.apache.hadoop.mapreduce.Reducer.Context context) 
       throws IOException, InterruptedException { 

      result.setMin(null); 
      result.setMax(null); 
      result.setCount(0); 
      int sum = 0; 
      int count = 0; 
      for(MinMaxCountTuple tuple: values) 
      { 
       if(result.getMin() == null || 
         tuple.getMin().compareTo(result.getMin()) < 0) 
       { 
        result.setMin(tuple.getMin()); 
       } 

       if(result.getMax() == null || 
         tuple.getMax().compareTo(result.getMax()) > 0) { 
        result.setMax(tuple.getMax()); 
       } 

       System.err.println(count++); 

       sum += tuple.getCount(); 
      } 

      result.setCount(sum); 
      context.write(userId, result); 
     } 

    } 

    /** 
    * @param args 
    */ 
    public static void main(String[] args) throws Exception { 
     Configuration conf = new Configuration(); 
     String [] otherArgs = new GenericOptionsParser(conf, args) 
          .getRemainingArgs(); 
     if(otherArgs.length < 2) 
     { 
      System.err.println("Usage MinMaxCout input output"); 
      System.exit(2); 
     } 


     Job job = new Job(conf, "Summarization min max count"); 
     job.setJarByClass(MinMaxCount.class); 
     job.setMapperClass(MinMaxCountMapper.class); 
     //job.setCombinerClass(MinMaxCountReducer.class); 
     job.setReducerClass(MinMaxCountReducer.class); 
     job.setOutputKeyClass(Text.class); 
     job.setOutputValueClass(MinMaxCountTuple.class); 

     FileInputFormat.setInputPaths(job, new Path(otherArgs[0])); 
     FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); 

     boolean result = job.waitForCompletion(true); 
     if(result) 
     { 
      System.exit(0); 
     }else { 
      System.exit(1); 
     } 

    } 

} 

Input: 
<row Id="8189677" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-07-30T07:29:33.343" UserId="831878" /> 
<row Id="8189677" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-08-01T07:29:33.343" UserId="831878" /> 
<row Id="8189677" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-08-02T07:29:33.343" UserId="831878" /> 
<row Id="8189678" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-06-30T07:29:33.343" UserId="931878" /> 
<row Id="8189678" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-07-01T07:29:33.343" UserId="931878" /> 
<row Id="8189678" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-08-02T07:29:33.343" UserId="931878" /> 

output file contents part-r-00000: 

831878 2011-07-30T07:29:33.343 2011-07-30T07:29:33.343 1 
831878 2011-08-01T07:29:33.343 2011-08-01T07:29:33.343 1 
831878 2011-08-02T07:29:33.343 2011-08-02T07:29:33.343 1 
931878 2011-06-30T07:29:33.343 2011-06-30T07:29:33.343 1 
931878 2011-07-01T07:29:33.343 2011-07-01T07:29:33.343 1 
931878 2011-08-02T07:29:33.343 2011-08-02T07:29:33.343 1 

job submission output: 


12/12/16 11:13:52 INFO input.FileInputFormat: Total input paths to process : 1 
12/12/16 11:13:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 
12/12/16 11:13:52 WARN snappy.LoadSnappy: Snappy native library not loaded 
12/12/16 11:13:52 INFO mapred.JobClient: Running job: job_201212161107_0001 
12/12/16 11:13:53 INFO mapred.JobClient: map 0% reduce 0% 
12/12/16 11:14:06 INFO mapred.JobClient: map 100% reduce 0% 
12/12/16 11:14:18 INFO mapred.JobClient: map 100% reduce 100% 
12/12/16 11:14:23 INFO mapred.JobClient: Job complete: job_201212161107_0001 
12/12/16 11:14:23 INFO mapred.JobClient: Counters: 26 
12/12/16 11:14:23 INFO mapred.JobClient: Job Counters 
12/12/16 11:14:23 INFO mapred.JobClient:  Launched reduce tasks=1 
12/12/16 11:14:23 INFO mapred.JobClient:  SLOTS_MILLIS_MAPS=12264 
12/12/16 11:14:23 INFO mapred.JobClient:  Total time spent by all reduces waiting after reserving slots (ms)=0 
12/12/16 11:14:23 INFO mapred.JobClient:  Total time spent by all maps waiting after reserving slots (ms)=0 
12/12/16 11:14:23 INFO mapred.JobClient:  Launched map tasks=1 
12/12/16 11:14:23 INFO mapred.JobClient:  Data-local map tasks=1 
12/12/16 11:14:23 INFO mapred.JobClient:  SLOTS_MILLIS_REDUCES=10124 
12/12/16 11:14:23 INFO mapred.JobClient: File Output Format Counters 
12/12/16 11:14:23 INFO mapred.JobClient:  Bytes Written=342 
12/12/16 11:14:23 INFO mapred.JobClient: FileSystemCounters 
12/12/16 11:14:23 INFO mapred.JobClient:  FILE_BYTES_READ=204 
12/12/16 11:14:23 INFO mapred.JobClient:  HDFS_BYTES_READ=888 
12/12/16 11:14:23 INFO mapred.JobClient:  FILE_BYTES_WRITTEN=43479 
12/12/16 11:14:23 INFO mapred.JobClient:  HDFS_BYTES_WRITTEN=342 
12/12/16 11:14:23 INFO mapred.JobClient: File Input Format Counters 
12/12/16 11:14:23 INFO mapred.JobClient:  Bytes Read=761 
12/12/16 11:14:23 INFO mapred.JobClient: Map-Reduce Framework 
12/12/16 11:14:23 INFO mapred.JobClient:  Map output materialized bytes=204 
12/12/16 11:14:23 INFO mapred.JobClient:  Map input records=6 
12/12/16 11:14:23 INFO mapred.JobClient:  Reduce shuffle bytes=0 
12/12/16 11:14:23 INFO mapred.JobClient:  Spilled Records=12 
12/12/16 11:14:23 INFO mapred.JobClient:  Map output bytes=186 
12/12/16 11:14:23 INFO mapred.JobClient:  Total committed heap usage (bytes)=269619200 
12/12/16 11:14:23 INFO mapred.JobClient:  Combine input records=0 
12/12/16 11:14:23 INFO mapred.JobClient:  SPLIT_RAW_BYTES=127 
12/12/16 11:14:23 INFO mapred.JobClient:  Reduce input records=6 
12/12/16 11:14:23 INFO mapred.JobClient:  Reduce input groups=2 
12/12/16 11:14:23 INFO mapred.JobClient:  Combine output records=0 
12/12/16 11:14:23 INFO mapred.JobClient:  Reduce output records=6 
12/12/16 11:14:23 INFO mapred.JobClient:  Map output records=6 
+1

你可以发布你正在使用的输入数据(回到原来的问题,而不是评论)。 –

+0

本代码示例来自我的书MapReduce设计模式“数字汇总”。我很想搞清楚这个问题,但从所提供的信息中看不到。我将开始查看我们的代码并通过一些示例数据运行它。如果您可以发布您看到的示例输入/输出,这将非常有帮助。 –

+0

https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch2/MinMaxCountDriver.java <---原始代码,对于任何有兴趣的人 –

回答

4

啊抓到罪魁祸首,只是改变你的减少方法的签名如下:

protected void reduce(Text userId, Iterable<MinMaxCountTuple> values, Context context) throws IOException, InterruptedException {

基本上你只需要拥有Context,而不是org.apache.hadoop.mapreduce.Reducer.Context

现在输出的样子:

831878 2011-07-30T07:29:33.343 2011-08-02T07:29:33.343 3 
931878 2011-06-30T07:29:33.343 2011-08-02T07:29:33.343 3 

我在本地为你测试了它,而这个改变确实有效。虽然这是一种奇怪的行为,如果任何人都能在这件事情上发现,那将会很棒。但它与泛型有关。由于在使用org.apache.hadoop.mapreduce.Reducer.Context它说:

"Reducer.Context is a raw type. References to generic type Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>.Context should be parameterized" 

但是,当只有“语境”使用这是正常的。

+0

谢谢Amar解决了这个问题。谢谢你的时间。 – user1207659

+0

欢迎:)不要忘记upvote答案:)欢呼声。 – Amar

+0

这是为什么?使用'org.apache.hadoop.mapreduce.Reducer.Context'和'context'有什么区别?我也很好奇.... – LazerSharks