2015-09-29 32 views
0

我使用配置单元serDe(https://github.com/dvasilen/Hive-XML-SerDe/wiki/XML-data-sources)进行XML解析并将其加载到配置单元。 示例XML内容:Xml与配置单元解析

<records> 
<record customer_id="0000-JTALA"> 
<income>200000</income> 
<address type="M"> 
<Flatno>345</FlatNo> 
<Street>ABS</street> 
<city>QWW</city> 
<country>US</country> 
<pin>3235</pin> 
</address> 
<address type="B"> 
<Street>ABS</street> 
<city>QWW</city> 
<country>US</country> 
<pin>3235</pin> 
</address>  
</record> 

<record customer_id="0001-JTALA"> 
<income>200000</income> 
<address type="M"> 
<Flatno>45</FlatNo> 
<Street>fgBS</street> 
<city>QWW</city> 
<country>US</country> 
<pin>3235</pin> 
</address> 
<address type="B"> 
<Street>ABS</street> 
<city>QWW</city> 
<country>US</country> 
<pin>325</pin> 
</address> 
<address type="P"> 
<Street>ABS</street> 
<city>QWW</city> 
<country>UK</country> 
<pin>325</pin> 
</address> 
</record> 
</records> 

对于行应创建的每个地址。根据上面的第一个客户的样本应该创建2个记录,第二个客户3个记录应该创建总共5个记录,按照我现在的代码,每个单个客户创建两个记录,并且在地址列中所有地址都连接在一起为第一个客户街道栏目(第一个地址街道+第二个街道地址)。 样品查询:

CREATE external TABLE msg_details(customer_id STRING, income BIGINT, AType String,Flatno String, Street string,city string,country string,pin string) 
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe' 
WITH SERDEPROPERTIES (
"column.xpath.customer_id"="/record/@customer_id", 
"column.xpath.income"="/record/income/text()", 
"column.xpath.address_type"="/record/address/@type", 
"column.xpath.Flatno"="/record/address/Flatno/text()", 
"column.xpath.Street"="/record/address/Street/text()", 
"column.xpath.city"="/record/address/city/text()", 
"column.xpath.country"="/record/address/country/text()" 
"column.xpath.pin"="/record/address/pin/text()" 
) 
STORED AS 
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' 
location '/user/root/serdeinput' 
TBLPROPERTIES (
"xmlinput.start"="<record customer", 
"xmlinput.end"="</record>" 
); 
+0

任何人都可以帮助我! – bhargavi

回答

0

一种方法是编写用户定义(自定义)serdy XML解析。
[或] 编写用于将包含相同列值的Array值拆分为行的UDF。

您使用的serde是通用的,它几乎等同于Hive serde提供的xpath,它们都具有有限的功能来提取记录。

我尝试了3种使用横向视图和其他方法的其他方法,但不适用于地址类型中的所有列。

唯一的解决方案是继续进行自定义Serde根据您的要求进行解析。

create external table msg_details3(customer_id string, income bigint, address_type Array<string>,Flatno Array<string>, Street ARRAY<string>,city ARRAY<string>,country ARRAY<string>,pin ARRAY<string>) 
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe' 
WITH SERDEPROPERTIES (
"column.xpath.customer_id"="/record/@customer_id", 
"column.xpath.income"="/record/income/text()", 
"column.xpath.address_type"="/record/address/@type", 
"column.xpath.Flatno"="/record/address/Flatno/text()", 
"column.xpath.Street"="/record/address/Street/text()", 
"column.xpath.city"="/record/address/city/text()", 
"column.xpath.country"="/record/address/country/text()", 
"column.xpath.pin"="/record/address/pin/text()" 
) 
STORED AS 
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' 
location '/user/cloudera/data' 
TBLPROPERTIES (
"xmlinput.start"="<record ", 
"xmlinput.end"="</record>" 
);