2013-03-22 30 views
0

我做处理XML用Hadoop流失败

bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -inputreader "StreamXmlRecordReader, begin=<metaData>,end=</metaData>" -input /user/root/xmlpytext/metaData.xml -mapper /Users/amrita/desktop/hadoop/pythonpractise/mapperxml.py -file /Users/amrita/desktop/hadoop/pythonpractise/mapperxml.py -reducer /Users/amrita/desktop/hadoop/pythonpractise/reducerxml.py -file /Users/amrita/desktop/hadoop/pythonpractise/mapperxml.py -output /user/root/xmlpytext-output1 -numReduceTasks 1 

,但它显示

13/03/22 09:38:48 INFO mapred.FileInputFormat: Total input paths to process : 1 
13/03/22 09:38:49 INFO streaming.StreamJob: getLocalDirs(): [/Users/amrita/desktop/hadoop/temp/mapred/local] 
13/03/22 09:38:49 INFO streaming.StreamJob: Running job: job_201303220919_0001 
13/03/22 09:38:49 INFO streaming.StreamJob: To kill this job, run: 
13/03/22 09:38:49 INFO streaming.StreamJob: /private/var/root/hadoop-1.0.4/libexec/../bin/hadoop job -Dmapred.job.tracker=-kill job_201303220919_0001 
13/03/22 09:38:49 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201303220919_0001 
13/03/22 09:38:50 INFO streaming.StreamJob: map 0% reduce 0% 
13/03/22 09:39:26 INFO streaming.StreamJob: map 100% reduce 100% 
13/03/22 09:39:26 INFO streaming.StreamJob: To kill this job, run: 
13/03/22 09:39:26 INFO streaming.StreamJob: /private/var/root/hadoop-1.0.4/libexec/../bin/hadoop job -Dmapred.job.tracker=-kill job_201303220919_0001 
13/03/22 09:39:26 INFO streaming.StreamJob: Tracking URL: http:///jobdetails.jsp?jobid=job_201303220919_0001 
13/03/22 09:39:26 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201303220919_0001_m_000000 
13/03/22 09:39:26 INFO streaming.StreamJob: killJob... 
Streaming Command Failed! 

当我通过jobdetails.jsp去那里就说明

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException 
    at org.apache.hadoop.streaming.StreamInputFormat.getRecordReader(StreamInputFormat.java:77) 
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:197) 
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418) 
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) 
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:396) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) 
    at org.apache.hadoop.mapred.Child.main(Child.java:249) 
Caused by: java.lang.reflect.InvocationTargetException 
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) 
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) 
    at java.lang.reflect.Constructor.newInstance(Constructor.java:513) 
    at org.apache.hadoop.streaming.StreamInputFormat.getRecordReader(StreamInputFormat.java:74) 
    ... 8 more 
Caused by: java.io.IOException: JobConf: missing required property: stream.recordreader.begin 
    at org.apache.hadoop.streaming.StreamXmlRecordReader.checkJobGet(StreamXmlRecordReader.java:278) 
    at org.apache.hadoop.streaming.StreamXmlRecordReader.<init>(StreamXmlRecordReader.java:52) 
    ... 13 more 

我映射

#!/usr/bin/env python 
import sys 
import cStringIO 
import xml.etree.ElementTree as xml 
def cleanResult(element): 
    result = None 
    if element is not None: 
     result = element.text 
     result = result.strip() 
    else: 
     result = "" 
    return result 
def process(val): 
    root = xml.fromstring(val) 
    sceneID = cleanResult(root.find('sceneID')) 
    cc = cleanResult(root.find('cloudCover')) 
    returnval = ("%s,%s") % (sceneID,cc) 
    return returnval.strip() 
if __name__ == '__main__': 
    buff = None 
    intext = False 
    for line in sys.stdin: 
     line = line.strip() 
     if line.find("<metaData>") != -1: 
      intext = True 
      buff = cStringIO.StringIO() 
      buff.write(line) 
     elif line.find("</metaData>") != -1: 
       intext = False 
       buff.write(line) 
       val = buff.getvalue() 
       buff.close() 
       buff = None 
       print process(val) 
     else: 
      if intext: 
       buff.write(line) 

和减速

#!/usr/bin/env python 
import sys 
if __name__ == '__main__': 
    for line in sys.stdin: 
     print line.strip() 

谁能告诉我为什么会这样。 我正在使用hadoop-1.0.4 im mac。 有什么不对吗。我应该改变任何事情吗? 请帮我一把。

回答

0

尝试设置缺少的配置变量如下(添加stream.recordreader.前缀,并确保他们的罐子后的第一个参数,用双引号括起来):

bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar \ 
    "-Dstream.recordreader.begin=<metaData>" \ 
    "-Dstream.recordreader.end=</metaData>" \ 
    -inputreader "StreamXmlRecordReader \ 
    -input /user/root/xmlpytext/metaData.xml \ 
    -mapper /Users/amrita/desktop/hadoop/pythonpractise/mapperxml.py \ 
    -file /Users/amrita/desktop/hadoop/pythonpractise/mapperxml.py \ 
    -reducer /Users/amrita/desktop/hadoop/pythonpractise/reducerxml.py \ 
    -file /Users/amrita/desktop/hadoop/pythonpractise/mapperxml.py \ 
    -output /user/root/xmlpytext-output1 \ 
    -numReduceTasks 1 
+0

是我没有以前的错误消失了,但现在呈现“了java.lang.RuntimeException: PipeMapRed.waitOutputThreads():subprocess failed with code 1“ – 2013-03-23 03:47:11

+0

然后,我会说你的Mapper或reducer python有问题 - 尝试并通过命令行发送一行代码 – 2013-03-23 12:07:58

+0

mapper和re ducer在本地文件系统工作 – 2013-03-23 15:56:51

1

删除空间逗号之间,并开始, begin=<

的正确格式为:

hadoop jar hadoop-streaming.jar -inputreader 
"StreamXmlRecord,begin=BEGIN_STRING,end=END_STRING" ..... (rest of the command) 

这是由于围绕以下几行代码org.apache.hadoop.streaming.StreamJob

for (int i = 1; i < args.length; i++) { 
    String[] nv = args[i].split("=", 2); 
    String k = "stream.recordreader." + nv[0]; 
    String v = (nv.length > 1) ? nv[1] : ""; 
    jobConf_.set(k, v); 
    } 
+0

Hiya,欢迎来到Stack Overflow!这很好,你已经解决了它。您还可以为解决方案的工作原理添加一个(非常简短的)解释。我知道一个不会很多的错字 - 但它对任何n00bs都很有用,它会偶然发现这个问题+解决方案,并且需要知道错误的原因。 ;) – 2014-03-27 22:45:40