2013-01-01 59 views
2

我正在使用Python编写简单的流式地图缩减作业,以便在Amazon EMR上运行。它基本上是用户记录的聚合器,它将每个用户ID的条目分组在一起。亚马逊EMR作业失败:由于步骤失败而关闭

映射

#!/usr/bin/env python 
import sys 

def main(argv): 
line = sys.stdin.readline() 
try: 
    while line: 
     line = line.rstrip() 
     elements = line.split() 
     print '%s\t%s' % (elements[0] , (elements[1],elements[2])) 
     line = sys.stdin.readline() 
except "end of file": 
    return None 

if __name__ == '__main__': 
    main(sys.argv) 

减速机:

#!/usr/bin/env python 
import sys 

def main(argv): 
users=dict() 
for line in sys.stdin: 
    elements=line.split('\t',1) 
    if elements[0] in users: 
     users[elements[0]].append(elements[1]) 
    else: 
     users[elements[0]]=elements[1] 

for user in users: 
    print '%s\t%s'% (user, users[user]) 

if __name__ == '__main__': 
    main(sys.argv) 

这项工作应在五个文字文件的目录中运行。该EMR作业的参数是:

输入:[桶名]/[输入的文件夹名称]

输出:[桶名] /输出

映射器:[铲斗名称] /mapper.py

减速机:[斗名称] /reducer.py

的工作保持与失败的原因:如步骤失败关闭。这是一份日志

2013-01-01 12:06:16,270 INFO org.apache.hadoop.mapred.JobClient (main): Default number  of map tasks: null 

2013-01-01 12:06:16,271 INFO org.apache.hadoop.mapred.JobClient (main): Setting default number of map tasks based on cluster size to : 8 

2013-01-01 12:06:16,271 INFO org.apache.hadoop.mapred.JobClient (main): Default number of reduce tasks: 3 

2013-01-01 12:06:18,392 INFO org.apache.hadoop.security.ShellBasedUnixGroupsMapping (main): add hadoop to shell userGroupsCache 

2013-01-01 12:06:18,393 INFO org.apache.hadoop.mapred.JobClient (main): Setting group to hadoop 

2013-01-01 12:06:18,647 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader (main): Loaded native gpl library 

2013-01-01 12:06:18,670 WARN com.hadoop.compression.lzo.LzoCodec (main): Could not find build properties file with revision hash 

2013-01-01 12:06:18,670 INFO com.hadoop.compression.lzo.LzoCodec (main): Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN] 

2013-01-01 12:06:18,695 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy (main): Snappy native library is available 

2013-01-01 12:06:18,695 INFO org.apache.hadoop.io.compress.snappy.LoadSnappy (main): Snappy native library loaded 

2013-01-01 12:06:19,050 INFO org.apache.hadoop.mapred.FileInputFormat (main): Total input paths to process : 5 

2013-01-01 12:06:20,688 INFO org.apache.hadoop.streaming.StreamJob (main): getLocalDirs(): [/mnt/var/lib/hadoop/mapred] 

2013-01-01 12:06:20,688 INFO org.apache.hadoop.streaming.StreamJob (main): Running job: job_201301011204_0001 

2013-01-01 12:06:20,688 INFO org.apache.hadoop.streaming.StreamJob (main): To kill this job, run: 

2013-01-01 12:06:20,688 INFO org.apache.hadoop.streaming.StreamJob (main): /home/hadoop/bin/hadoop job -Dmapred.job.tracker=10.255.131.225:9001 -kill job_201301011204_0001 

2013-01-01 12:06:20,689 INFO org.apache.hadoop.streaming.StreamJob (main): Tracking URL: http://domU-12-31-39-01-7C-13.compute-1.internal:9100/jobdetails.jsp?jobid=job_201301011204_0001 

2013-01-01 12:06:21,696 INFO org.apache.hadoop.streaming.StreamJob (main): map 0% reduce 0% 

2013-01-01 12:08:02,238 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 100% 

2013-01-01 12:08:02,239 INFO org.apache.hadoop.streaming.StreamJob (main): To kill this job, run: 

2013-01-01 12:08:02,240 INFO org.apache.hadoop.streaming.StreamJob (main): /home/hadoop/bin/hadoop job -Dmapred.job.tracker=10.255.131.225:9001 -kill job_201301011204_0001 

2013-01-01 12:08:02,240 INFO org.apache.hadoop.streaming.StreamJob (main): Tracking URL: http://domU-12-31-39-01-7C-13.compute-1.internal:9100/jobdetails.jsp?jobid=job_201301011204_0001 

2013-01-01 12:08:02,240 ERROR org.apache.hadoop.streaming.StreamJob (main): Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201301011204_0001_m_000002 

2013-01-01 12:08:02,240 INFO org.apache.hadoop.streaming.StreamJob (main): killJob... 

我在做什么错?

+0

您是否在输入上尝试了代码:cat input.txt | ./mapper.py |排序| ./reducer.py> a.out – Guy

+0

是的,我做到了。它在当地很好地工作。 – user1261046

回答

1

已解决。请确保在Python脚本的开头没有任何注释,并且脚本的第一行是#!/usr/bin/env python