为什么减少输入记录与减少输出记录不同？

我尝试在库dumbo中使用python中的mapreducer。下面是我的实验测试代码，我希望我可以接收从mapper到reducer输出的所有记录。为什么减少输入记录与减少输出记录不同？

def mapper(key, value): 
    fields = value.split("\t");  
    myword = fields[0] + "\t" + fields[1] 
    yield myword, value 

def reducer(key, values): 
    for value in values: 
     mypid = value 
     words = value.split("\t") 
    global count 
    count = count + 1 
    myword = str(count) + "--" + words[1] ##to count total lines in recuder's output records 
    yield myword, 1 

if __name__ == "__main__": 
    dumbo.run(mapper, reducer)

以下是Map-Reduce Framework的日志。我期望“减少输入记录”等于“减少输出记录”，但事实并非如此。我的测试代码有什么问题或者我误解了mapreducer中的某些内容？谢谢。

Map-Reduce Framework 
      Map input records=405057 
      Map output records=405057 
      Map output bytes=107178919 
      Map output materialized bytes=108467155 
      Input split bytes=2496 
      Combine input records=0 
      Combine output records=0 
      Reduce input groups=63096 
      Reduce shuffle bytes=108467155 
      Reduce input records=405057 
      Reduce output records=63096 
      Spilled Records=810114

它是当如下修改减速工作：

def reducer(key, values): 
    global count 
    for value in values: 
     mypid = value 
     words = value.split("\t") 

     count = count + 1 
     myword = str(count) + "--" + words[1] ##to count total lines in recuder's output records 
     yield myword, 1

来源

2015-11-13 Naturehigh

我预计“减少输入记录”等于“减少输出记录”，但事实并非如此。

我不知道你为什么期望这个。减速器的重点在于它一次接收一组值（基于映射器发出的密钥）;并且您的减速器仅为每个组发出一条记录（yield myword, 1）。因此，如果每个组只包含一条记录—，即每个值中的前两个字段在您的记录集中都是唯一的，则“减少输入记录”的唯一方式就等于您的“减少输出记录”。由于显然不是这种情况，您的减速器发出的记录数量比它收到的要少。

（事实上，这是通常的模式;这就是“reducer”被称为的原因。这个名称来自函数式语言中的'reduce'，它将一组值减少为一个值。）

来源

2015-11-13 08:10:06 ruakh

谢谢ruakh，我解决它。 – Naturehigh

为什么减少输入记录与减少输出记录不同？

回答

相关问题