我使用hadoop来处理xml文件,所以我在python中编写了mapper文件,reducer文件。如何使用python将数据从hadoop保存到数据库
假设输入需要处理是的test.xml
<report>
<report-name name="ALL_TIME_KEYWORDS_PERFORMANCE_REPORT"/>
<date-range date="All Time"/>
<table>
<columns>
<column name="campaignID" display="Campaign ID"/>
<column name="adGroupID" display="Ad group ID"/>
</columns>
<row campaignID="79057390" adGroupID="3451305670"/>
<row campaignID="79057390" adGroupID="3451305670"/>
</table>
</report>
mapper.py文件
import sys
import cStringIO
import xml.etree.ElementTree as xml
if __name__ == '__main__':
buff = None
intext = False
for line in sys.stdin:
line = line.strip()
if line.find("<row") != -1:
.............
.............
.............
print '%s\t%s'%(campaignID,adGroupID)
reducer.py文件
import sys
if __name__ == '__main__':
for line in sys.stdin:
print line.strip()
我曾与下面的命令运行Hadoop的
bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar
- file /path/to/mapper.py file -mapper /path/to/mapper.py file
-file /path/to/reducer.py file -reducer /path/to/reducer.py file
-input /path/to/input_file/test.xml
-output /path/to/output_folder/to/store/file
当我运行上面的命令Hadoop是在我们reducer.py
文件与所需的数据
现在毕竟提到正确的格式创建的输出路径的输出文件我想要做的是,我不想将输出数据存储在由haddop默认创建的文本文件中,当我运行上述命令时,而是我想将数据保存到一个MYSQL
数据库
所以我写了一些python代码在reducer.py
文件中写入数据直接MYSQL
数据库,并试图通过移除输出路径如下
bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar
- file /path/to/mapper.py file -mapper /path/to/mapper.py file
-file /path/to/reducer.py file -reducer /path/to/reducer.py file
-input /path/to/input_file/test.xml
运行上面的命令,我得到的错误类似下面
12/11/08 15:20:49 ERROR streaming.StreamJob: Missing required option: output
Usage: $HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/hadoop-streaming.jar [options]
Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-combiner <cmd|JavaClassName> The streaming command to run
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
-inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
-outputformat TextOutputFormat(default)|JavaClassName Optional.
.........................
.........................
- 我所有的疑问是后处理文件后如何在
Database
中保存数据? - 我们可以在哪个文件(mapper.py/reducer.py?)中编写将数据写入数据库的代码
- 哪个命令用于运行hadoop以将数据保存到数据库中,因为当我删除输出文件夹路径在hadoop命令,它显示一个错误。
任何人都可以请帮我解决上述问题..................
编辑
加工其次
创建
mapper
和reducer
文件,上面读取XML文件,并创建由hadoop command
防爆一些文件夹中的文本文件:文本文件所在的文件夹(使用hadoop命令处理xml文件的结果)低于
/家庭/本地/用户/ Hadoop的/ xml_processing/xml_output /部分00000
这里的XML文件的大小是1.3 GB
和用Hadoop处理后产生的text file
的大小
现在我想要做的是尽可能快地做到reading the text file in the above path and saving data to the mysql database
。
我已经试过这个与基本的Python,但是正在采取一些350 sec
来处理文本文件并保存到MySQL数据库。
现在,通过蔻指示下载sqoop并解压在某些路径类似下面
/home/local/user/sqoop-1.4.2.bin__hadoop-0.20
,并进入到bin
文件夹,并键入./sqoop
和我接收到下面的错误
sh-4.2$ ./sqoop
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.
Try 'sqoop help' for usage.
我也有试过下面
./sqoop export --connect jdbc:mysql://localhost/Xml_Data --username root --table PerformaceReport --export-dir /home/local/user/Hadoop/xml_processing/xml_output/part-00000 --input-fields-terminated-by '\t'
结果
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.
12/11/27 11:54:57 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
12/11/27 11:54:57 INFO tool.CodeGenTool: Beginning code generation
12/11/27 11:54:57 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:636)
at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:525)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:548)
at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:191)
at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:175)
at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:262)
at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1235)
at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1060)
at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:82)
at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:64)
at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:97)
at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57)
无论上述sqoop命令是用于读取文本文件,并保存在数据库的功能有用吗? ,因为我们必须从文本文件处理并插入数据库!
ki已经开始使用sqoop并试图安装它,请看看它http://stackoverflow.com/questions/13411525/sqoop-installation-error-on-fedora-15 –
其实这里我想处理xml文件和将数据从xml文件存储到数据库(实际上反过来),你能给我提供一个基本的例子,用sqoop处理xml文件并将数据保存到数据库中吗? –
我假设你能够设置sqoop,如果没有,我会帮你。应该使用sqoop将hdfs输出文件直接发送到mysql。请回复这是否有助于您 –