包含HTML标记的Hadoop MapReduce作业

我有一堆大型的HTML文件，我想在它们上运行Hadoop MapReduce作业来查找最常用的单词。我用Python编写了我的mapper和reducer，并使用Hadoop streaming来运行它们。包含HTML标记的Hadoop MapReduce作业

这里是我的映射：

#!/usr/bin/env python 

import sys 
import re 
import string 

def remove_html_tags(in_text): 
''' 
Remove any HTML tags that are found. 

''' 
    global flag 
    in_text=in_text.lstrip() 
    in_text=in_text.rstrip() 
    in_text=in_text+"\n" 

    if flag==True: 
     in_text="<"+in_text 
     flag=False 
    if re.search('^<',in_text)!=None and re.search('(>\n+)$', in_text)==None: 
     in_text=in_text+">" 
     flag=True 
    p = re.compile(r'<[^<]*?>') 
    in_text=p.sub('', in_text) 
    return in_text 

# input comes from STDIN (standard input) 
global flag 
flag=False 
for line in sys.stdin: 
    # remove leading and trailing whitespace, set to lowercase and remove HTMl tags 
    line = line.strip().lower() 
    line = remove_html_tags(line) 
    # split the line into words 
    words = line.split() 
    # increase counters 
    for word in words: 
     # write the results to STDOUT (standard output); 
     # what we output here will be the input for the 
     # Reduce step, i.e. the input for reducer.py 
     # 
     # tab-delimited; the trivial word count is 1 
     if word =='': continue 
     for c in string.punctuation: 
      word= word.replace(c,'') 

     print '%s\t%s' % (word, 1)

这里是我的减速器：

#!/usr/bin/env python 

from operator import itemgetter 
import sys 

# maps words to their counts 
word2count = {} 

# input comes from STDIN 
for line in sys.stdin: 
    # remove leading and trailing whitespace 
    line = line.strip() 

    # parse the input we got from mapper.py 
    word, count = line.split('\t', 1) 
    # convert count (currently a string) to int 
    try: 
     count = int(count) 
     word2count[word] = word2count.get(word, 0) + count 
    except ValueError: 
     pass 

sorted_word2count = sorted(word2count.iteritems(), 
key=lambda(k,v):(v,k),reverse=True) 

# write the results to STDOUT (standard output) 
for word, count in sorted_word2count: 
    print '%s\t%s'% (word, count)

每当我管一个小样本的小串像“世界你好你好你好世界......”我得到排名列表的正确输出。然而，当我尝试使用一个小的HTML文件，并尝试使用猫管HTML到我的映射器，我得到以下错误（输入2包含了一些HTML代码）：

[email protected]:~$ cat input2 | /home/rohanbk/mapper.py | sort | /home/rohanbk/reducer.py 
Traceback (most recent call last): 
    File "/home/rohanbk/reducer.py", line 15, in <module> 
    word, count = line.split('\t', 1) 
ValueError: need more than 1 value to unpack

任何人都可以解释为什么我得到这个？另外，调试MapReduce作业程序的好方法是什么？

来源

2009-12-03 GobiasKoffi

您可以只甚至重现bug：

echo "hello - world" | ./mapper.py | sort | ./reducer.py

问题就在这里：

if word =='': continue 
for c in string.punctuation: 
      word= word.replace(c,'')

如果word是一个标点符号，如将是上述输入的情况下（之后它被分割），然后它被转换为一个空字符串。所以，只需将替换后的空字符串检查移动到。

来源

2009-12-03 21:53:22 codelogic

假设如果您使用cat并获得了期望的输出，那么MapReduce步骤将起作用是否安全？ – GobiasKoffi 2009-12-04 02:44:31

为了更愉快的Python/Hadoop集成体验，您可以考虑使用Dumbo。 – drxzcl 2009-12-22 15:50:27

包含HTML标记的Hadoop MapReduce作业

回答

相关问题