2017-03-27 26 views
0

因此,我一直在关注本网站上的Mapreduce python代码(http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/),它从文本文件返回一个字数(即单词和它发生的次数文本)。但是,我想知道如何返回最大发生的单词。映射器和减速情况如下 -在Hadoop Mapreduce字数统计中获取最大字数

#Mapper 

import sys 

# input comes from STDIN (standard input) 
for line in sys.stdin: 
    # remove leading and trailing whitespace 
    line = line.strip() 
    # split the line into words 
    words = line.split() 
    # increase counters 
    for word in words: 
     # write the results to STDOUT (standard output); 
     # what we output here will be the input for the 
     # Reduce step, i.e. the input for reducer.py 
     # 
     # tab-delimited; the trivial word count is 1 
     print '%s\t%s' % (word, 1) 

#Reducer 

from operator import itemgetter 
import sys 

current_word = None 
current_count = 0 
word = None 

# input comes from STDIN 
for line in sys.stdin: 
    # remove leading and trailing whitespace 
    line = line.strip() 

    # parse the input we got from mapper.py 
    word, count = line.split('\t', 1) 

    # convert count (currently a string) to int 
    try: 
     count = int(count) 
    except ValueError: 
     # count was not a number, so silently 
     # ignore/discard this line 
     continue 

    # this IF-switch only works because Hadoop sorts map output 
    # by key (here: word) before it is passed to the reducer 
    if current_word == word: 
     current_count += count 
    else: 
     if current_word: 
      # write result to STDOUT 
      print '%s\t%s' % (current_word, current_count) 
     current_count = count 
     current_word = word 

# do not forget to output the last word if needed! 
if current_word == word: 
    print '%s\t%s' % (current_word, current_count) 

所以,我知道我需要添加一些减速的结束,但我只是不完全知道是什么。

+0

所以你只是想找到具有最大计数和输出它的单词? –

+0

没错。计数最多的字和计数本身。 – tattybojangler

+0

我猜测在reducer的末尾添加了一些代码,但我试图无济于事。 – tattybojangler

回答

1

您需要设置只有一个减速聚集所有的值(-numReduceTasks 1

这怎么您的减少应该是这样的:

max_count = 0 
max_word = None 

for line in sys.stdin: 
    # remove leading and trailing whitespace 
    line = line.strip() 

    # parse the input we got from mapper.py 
    word, count = line.split('\t', 1) 

    # convert count (currently a string) to int 
    try: 
     count = int(count) 
    except ValueError: 
     # count was not a number, so silently 
     # ignore/discard this line 
     continue 

    # this IF-switch only works because Hadoop sorts map output 
    # by key (here: word) before it is passed to the reducer 
    if current_word == word: 
     current_count += count 
    else: 
     # check if new word greater 
     if current_count > max_count: 
      max_count= current_count 
      max_word = current_word   
     current_count = count 
     current_word = word 

# do not forget to check last word if needed! 
if current_count > max_count: 
    max_count= current_count 
    max_word = current_word 

print '%s\t%s' % (max_word, max_count) 

但只有一个减速你失去并行,所以也许它会如果你在第一个之后运行这个工作,速度会更快,而不是相反。这样,你的mapper将和reducer一样。