1
我有一堆大型的HTML文件,我想在它们上运行Hadoop MapReduce作业来查找最常用的单词。我用Python编写了我的mapper和reducer,并使用Hadoop streaming来运行它们。包含HTML标记的Hadoop MapReduce作业
这里是我的映射:
#!/usr/bin/env python
import sys
import re
import string
def remove_html_tags(in_text):
'''
Remove any HTML tags that are found.
'''
global flag
in_text=in_text.lstrip()
in_text=in_text.rstrip()
in_text=in_text+"\n"
if flag==True:
in_text="<"+in_text
flag=False
if re.search('^<',in_text)!=None and re.search('(>\n+)$', in_text)==None:
in_text=in_text+">"
flag=True
p = re.compile(r'<[^<]*?>')
in_text=p.sub('', in_text)
return in_text
# input comes from STDIN (standard input)
global flag
flag=False
for line in sys.stdin:
# remove leading and trailing whitespace, set to lowercase and remove HTMl tags
line = line.strip().lower()
line = remove_html_tags(line)
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
if word =='': continue
for c in string.punctuation:
word= word.replace(c,'')
print '%s\t%s' % (word, 1)
这里是我的减速器:
#!/usr/bin/env python
from operator import itemgetter
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
except ValueError:
pass
sorted_word2count = sorted(word2count.iteritems(),
key=lambda(k,v):(v,k),reverse=True)
# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
print '%s\t%s'% (word, count)
每当我管一个小样本的小串像“世界你好你好你好世界......”我得到排名列表的正确输出。然而,当我尝试使用一个小的HTML文件,并尝试使用猫管HTML到我的映射器,我得到以下错误(输入2包含了一些HTML代码):
[email protected]:~$ cat input2 | /home/rohanbk/mapper.py | sort | /home/rohanbk/reducer.py
Traceback (most recent call last):
File "/home/rohanbk/reducer.py", line 15, in <module>
word, count = line.split('\t', 1)
ValueError: need more than 1 value to unpack
任何人都可以解释为什么我得到这个?另外,调试MapReduce作业程序的好方法是什么?
假设如果您使用cat并获得了期望的输出,那么MapReduce步骤将起作用是否安全? – GobiasKoffi 2009-12-04 02:44:31
为了更愉快的Python/Hadoop集成体验,您可以考虑使用Dumbo。 – drxzcl 2009-12-22 15:50:27