在Hadoop Mapreduce字数统计中获取最大字数

因此，我一直在关注本网站上的Mapreduce python代码（http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/），它从文本文件返回一个字数（即单词和它发生的次数文本）。但是，我想知道如何返回最大发生的单词。映射器和减速情况如下 -在Hadoop Mapreduce字数统计中获取最大字数

#Mapper 

import sys 

# input comes from STDIN (standard input) 
for line in sys.stdin: 
    # remove leading and trailing whitespace 
    line = line.strip() 
    # split the line into words 
    words = line.split() 
    # increase counters 
    for word in words: 
     # write the results to STDOUT (standard output); 
     # what we output here will be the input for the 
     # Reduce step, i.e. the input for reducer.py 
     # 
     # tab-delimited; the trivial word count is 1 
     print '%s\t%s' % (word, 1) 

#Reducer 

from operator import itemgetter 
import sys 

current_word = None 
current_count = 0 
word = None 

# input comes from STDIN 
for line in sys.stdin: 
    # remove leading and trailing whitespace 
    line = line.strip() 

    # parse the input we got from mapper.py 
    word, count = line.split('\t', 1) 

    # convert count (currently a string) to int 
    try: 
     count = int(count) 
    except ValueError: 
     # count was not a number, so silently 
     # ignore/discard this line 
     continue 

    # this IF-switch only works because Hadoop sorts map output 
    # by key (here: word) before it is passed to the reducer 
    if current_word == word: 
     current_count += count 
    else: 
     if current_word: 
      # write result to STDOUT 
      print '%s\t%s' % (current_word, current_count) 
     current_count = count 
     current_word = word 

# do not forget to output the last word if needed! 
if current_word == word: 
    print '%s\t%s' % (current_word, current_count)

所以，我知道我需要添加一些减速的结束，但我只是不完全知道是什么。

来源

2017-03-27 tattybojangler

所以你只是想找到具有最大计数和输出它的单词？ –

没错。计数最多的字和计数本身。 – tattybojangler

我猜测在reducer的末尾添加了一些代码，但我试图无济于事。 – tattybojangler

您需要设置只有一个减速聚集所有的值（-numReduceTasks 1）

这怎么您的减少应该是这样的：

max_count = 0 
max_word = None 

for line in sys.stdin: 
    # remove leading and trailing whitespace 
    line = line.strip() 

    # parse the input we got from mapper.py 
    word, count = line.split('\t', 1) 

    # convert count (currently a string) to int 
    try: 
     count = int(count) 
    except ValueError: 
     # count was not a number, so silently 
     # ignore/discard this line 
     continue 

    # this IF-switch only works because Hadoop sorts map output 
    # by key (here: word) before it is passed to the reducer 
    if current_word == word: 
     current_count += count 
    else: 
     # check if new word greater 
     if current_count > max_count: 
      max_count= current_count 
      max_word = current_word   
     current_count = count 
     current_word = word 

# do not forget to check last word if needed! 
if current_count > max_count: 
    max_count= current_count 
    max_word = current_word 

print '%s\t%s' % (max_word, max_count)

但只有一个减速你失去并行，所以也许它会如果你在第一个之后运行这个工作，速度会更快，而不是相反。这样，你的mapper将和reducer一样。

来源

2017-03-28 16:50:37 fi11er

在Hadoop Mapreduce字数统计中获取最大字数

回答

相关问题