2013-10-07 47 views
3

我有一个输入文件看起来像这样,它已经上传到HDFS/tmp/input(在^ A中分隔,这是一个非打印字符,这是VI中的视图)Hadoop Streaming Python Trivial Example不工作

A^A10 
A^A7 
A^A10 
A^A5 
A^A10 
A^A8 
B^A1 
A^A9 
B^A1 
A^A9 
B^A1 
A^A9 
B^A1  
A^A9 
B^A1 
A^A9 
B^A1 
A^A9 

我写的映射是这样的:

import sys 
for line in sys.stdin: 
    name, score = line.strip().split(chr(1)) 
    print '\t'.join([name, str(int(score)+1)]) 

减速看起来像这样(similar to):

import sys 
from datetime import datetime 

def calc(inputList): 
    return min(inputList) 

def main(): 
    current_key = None 
    value_list = [] 
    key = None 
    value = None 
    result = None 
    for line in sys.stdin: 
     try: 
      line = line.strip() 
      key, value = line.split('\t', 1) 

      try: 
       value = eval(value) 
      except: 
       continue 
      if current_key == key: 
       value_list.append(value) 
      else: 
       if current_key: 
        try: 
         result = str(calc(value_list)) 
        except: 
         pass 
        print '%s\t%s' % (current_key, result) 
       value_list = [value] 
       current_key = key 
     except: 
     pass 
    print '%s\t%s' % (current_key, str(calc(value_list))) 

if __name__ == '__main__': 
    main() 

我测试壳映射器和减速机和它的工作对我来说:

$ cat input | python mapper.py | sort -t$'\t' -k1 | python reducer.py 
A 6 
B 2 

但我失败了实现它使用Hadoop流:

/usr/bin/hadoop 
jar /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar 
-file mapper.py 
-mapper mapper.py 
-file reducer.py 
-reducer reducer.py 
-input /tmp/input 
-output /tmp/output 

错误输出看起来是这样的:

13/10/07 15:59:02 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 
13/10/07 15:59:02 INFO mapred.FileInputFormat: Total input paths to process : 1 
13/10/07 15:59:02 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-a59347/mapred/local] 
13/10/07 15:59:02 INFO streaming.StreamJob: Running job: job_201309301959_0089 
13/10/07 15:59:02 INFO streaming.StreamJob: To kill this job, run: 
13/10/07 15:59:02 INFO streaming.StreamJob: UNDEF/bin/hadoop job -Dmapred.job.tracker=url1:8021 -kill job_201309301959_0089 
13/10/07 15:59:02 INFO streaming.StreamJob: Tracking URL: http://url1:50030/jobdetails.jsp?jobid=job_201309301959_0089 
13/10/07 15:59:03 INFO streaming.StreamJob: map 0% reduce 0% 
13/10/07 15:59:10 INFO streaming.StreamJob: map 50% reduce 0% 
13/10/07 16:00:10 INFO streaming.StreamJob: map 100% reduce 0% 
13/10/07 16:00:26 INFO streaming.StreamJob: map 100% reduce 1% 
13/10/07 16:00:32 INFO streaming.StreamJob: map 100% reduce 2% 
13/10/07 16:00:37 INFO streaming.StreamJob: map 100% reduce 100% 
13/10/07 16:00:37 INFO streaming.StreamJob: To kill this job, run: 
13/10/07 16:00:37 INFO streaming.StreamJob: UNDEF/bin/hadoop job -Dmapred.job.tracker=url1:8021 -kill job_201309301959_0089 
13/10/07 16:00:37 INFO streaming.StreamJob: Tracking URL: http://url1:50030/jobdetails.jsp?jobid=job_201309301959_0089 
13/10/07 16:00:37 ERROR streaming.StreamJob: Job not successful. Error: NA 
13/10/07 16:00:37 INFO streaming.StreamJob: killJob... 
Streaming Command Failed! 

任何想法,我做错了吗?

+0

它是如何失败的?当你发出'/ usr/bin/hadoop jar ...'命令时,你可以发布在屏幕上打印的输出吗? – cabad

+0

@cabad感谢提醒,那是你需要的吗? –

回答

7

Hadoop框架不知道如何运行您的映射器和缩减器。有两种可能的修复方法:

修复1:显式调用python。

-mapper "python mapper.py" -reducer "python reducer.py" 

修复2:告诉Hadoop在哪里可以找到python解释器。要做到这一点,您需要明确告诉它在哪里可以找到它,在*.py文件的第一行。例如:

#!/usr/bin/env python 
+0

仍然是相同的结果..不工作。我没有在Hadoop Streaming wiki页面看到这个, –

+0

我增加了更多的细节。你能否看到Fix 2是否适合你? – cabad

+0

圣牛!这是工作! –