0
我尝试在库dumbo中使用python中的mapreducer。 下面是我的实验测试代码,我希望我可以接收从mapper到reducer输出的所有记录。为什么减少输入记录与减少输出记录不同?
def mapper(key, value):
fields = value.split("\t");
myword = fields[0] + "\t" + fields[1]
yield myword, value
def reducer(key, values):
for value in values:
mypid = value
words = value.split("\t")
global count
count = count + 1
myword = str(count) + "--" + words[1] ##to count total lines in recuder's output records
yield myword, 1
if __name__ == "__main__":
dumbo.run(mapper, reducer)
以下是Map-Reduce Framework的日志。 我期望“减少输入记录”等于“减少输出记录”,但事实并非如此。 我的测试代码有什么问题或者我误解了mapreducer中的某些内容? 谢谢。
Map-Reduce Framework
Map input records=405057
Map output records=405057
Map output bytes=107178919
Map output materialized bytes=108467155
Input split bytes=2496
Combine input records=0
Combine output records=0
Reduce input groups=63096
Reduce shuffle bytes=108467155
Reduce input records=405057
Reduce output records=63096
Spilled Records=810114
它是当如下修改减速工作:
def reducer(key, values):
global count
for value in values:
mypid = value
words = value.split("\t")
count = count + 1
myword = str(count) + "--" + words[1] ##to count total lines in recuder's output records
yield myword, 1
谢谢ruakh,我解决它。 – Naturehigh