2017-08-02 23 views
1

嗨,我可以将我的xml文件转换为熊猫数据框。但我面临的挑战是我没有在正确的行中获取记录,可以说我们在xml中有一组标记,例如它正在重复使用。 4倍,它有多个子节点应该是我的数据框的列,现在当我想读取XML我想要只在我的熊猫数据框中只有4行,但我得到太多与NaN行,因为所有其他标签躺在不同的水平上。python中的XML解析熊猫在一行中获取完整的标记块

编辑:刚才弄清楚了XML数据的描述/差异。提到的一个是最终编辑的XML数据 只需找出我的XML数据的一些问题...更新了正确和最终的XML内容。

Same <ns1:parenttag> is getting repeated over a xml file multiple times 

    <?xml version="1.0" encoding="UTF-8"?> 
    <row:user-agents xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xmlns:row="http://www.row.com" 
    xmlns:ns1="http://www.ns1.com" 
    xmlns:ns2="http://www.ns2.com" 
    xmlns:ns3="http://www.ns3.com" 
    xmlns:row1="http://www.row1.com" 
    xmlns:row3="http://www.row3.com" 
    xmlns:row2="http://www.row2.com" 
    xsi:schemaLocation="http://www.schemaLocation-1.4.xsd"> 

<row:agent1> 
<row:test> 
    <row2:test1> 
    <row2:test2> 
     <row2:test3>9999</row2:test3> 
     <row2:test4>aa</row2:test4> 
     <row2:test5>1</row2:test5> 
    </row2:test2> 
    </row2:test1> 
    <row2:test6>2017</row2:test6> 
</row:test> 
<row:agent2> 
<row3:agent3> 

     <ns1:parenttag> 
      <ns1:childtag1> 
       <ns1:subchildtag1> 
        <ns1:indenticaltag>123</ns1:indenticaltag> 
       </ns1:subchildtag1> 
      </ns1:childtag1> 
      <ns1:indenticaltag>456</ns1:indenticaltag> 
      <ns1:childtag2>N</ns1:childtag2> 
      <ns1:childtag3>0</ns1:childtag3> 
      <ns1:childtag4>N</ns1:childtag4> 
      <ns1:childtag5> 
       <ns2:subchildtag2 attributname="abc"> 
        <ns2:sub_subchildtag1>12 45</ns2:sub_subchildtag1> 
       </ns2:subchildtag2> 
      </ns1:childtag5> 
      <ns1:childtag6>tyu</ns1:childtag6> 
      <ns1:childtag7>2</ns1:childtag7> 
      <ns1:childtag8> poiu</ns1:childtag8> 
      <ns1:childtag9> 
       <ns3:subchildtag3>345</ns3:subchildtag3> 
       <ns3:subchildtag6>567</ns3:subchildtag6> 

      </ns1:childtag9> 
      <ns1:childtag10>N</ns1:childtag10> 
      <ns1:childtag11> 
       <ns3:subchildtag4>34</ns3:subchildtag4> 
       <ns3:subchildtag5>abc/123</ns3:subchildtag5> 
      </ns1:childtag11> 
      <ns1:childtag12> 
       <ns1:indenticaltag>234</ns1:indenticaltag> 
      </ns1:childtag12> 
     </ns1:parenttag> 

</row3:agent3> 
</row:agent2> 
</row:agent1> 
</row:user-agents> 

另一个XML这是父标签的期限有所不同:

 <ns1:parenttag> 
      <ns1:indenticaltag>123</ns1:indenticaltag> 
      <ns1:childtag2>N</ns1:childtag2> 
      <ns1:childtag3>0</ns1:childtag3> 
      <ns1:childtag4>N</ns1:childtag4> 
      <ns1:childtag5> 
       <ns2:subchildtag1 attributename0="poi"> 
        <ns2:sub_subchildtag1> 
         <ns2:sub_sub_subchildtag1> 
          <ns2:sub_sub_sub_subchildtag1 attributename1="3" attributename2="17">1234</ns2:sub_sub_sub_subchildtag1> 
         </ns2:sub_sub_subchildtag1> 
        </ns2:sub_subchildtag1> 
       </ns2:subchildtag1> 
      </ns1:childtag5> 
      <ns1:childtag6>12</ns1:childtag6> 
      <ns1:childtag7> qwer</ns1:childtag7> 
      <ns1:childtag8> 
       <ns3:subchildtag2>456</ns3:subchildtag2> 
      </ns1:childtag8> 
      <ns1:childtag9>N</ns1:childtag9> 
      <ns1:childtag10> 
       <ns3:subchildtag3>908</ns3:subchildtag3> 
       <ns3:subchildtag4>abc/123</ns3:subchildtag4> 
      </ns1:childtag10> 
     </ns1:parenttag>   

我使用的是目前在下面的答案被芭菲提示功能: 但得到这个错误:

i am getting ValueError: Length mismatch: Expected axis has 21 elements, new values have 22 elements erros 

    Also it has issue with indenticaltag column as its of same name thrice but hierarchy is different 
    but in dataframe i am getting only one indenticaltag column instead of 3 for example: 
    parent.child.indenticaltag, parent.child.subchild.indenticaltag and parent.child.subchild.sub_subchild.indenticaltag etc. 

输出数据帧为:

I will parse both xmls differently using one function only. 
    Would like to parse all the tags and their attribute as column name in 
    pandas. Also the column name should be 
    parent.child.subchild.sub_sub_subchildtag and for attributes it should 
    be parent.child.subchild.sub_sub_childtag.attribute 

他们是否有更好的方法来解析XML并以适当的格式获取记录?或者我错过了什么?

编辑:解决方案的工作,但增加了一些更复杂

I need your help for three points if you guys can suggest some pointers: 

    1) I need columns name for pandas dataframe as root.child.subchild.grandchild i am not sure how i can get it here ? as in my solution i was able to get. 
    2) the descendant function is very slow is any way we can speed it up ? 
    3) i have to multiple xml of same type present in one directory and i would like to generate one dataframe out of it by appending all xml results any best way to do ? 

回答

1

考虑一个在<xs:topcol>节点上使用lxml的xpath(),并使用lxml的parse()直接从文件中读取。 XPath循环迭代地附加到列表和字典容器以投射到数据框。此外,您所需的输出实际上是不对齐节点值:

import pandas as pd 
from lxml import etree 
import re 

pd.set_option('display.width', 1000) 

NSMAP = {'row': 'http://www.row.com', 
     'row3': 'http://www.row3.com', 
     'row1': 'http://www.row1.com', 
     'xs': 'http://www.xs.com', 
     'row2': 'http://www.row2.com'} 

xmldata = etree.parse('RowAgent.xml')  

data = [] 
inner = {} 
for el in xmldata.xpath('//xs:top_col', namespaces=NSMAP): 
    for i in el:         # PARSE CHILDREN 
     inner[i.tag] = i.text 
     if len(i.xpath('/*')) > 0:    # PARSE GRANDCHILDREN 
      for subi in i: 
       inner[subi.tag] = subi.text 

    data.append(inner) 
    inner = {} 

df = pd.DataFrame(data) 

# REGEX TO REMOVE NAMESPACE URIs IN COL NAMES 
df.columns = [re.sub(r'{.*}', '', col) for col in df.columns] 

为了解析无限的子元素使用XPath的descendant::*

num_top_cols = len(xmldata.xpath('//xs:top_col', namespaces=NSMAP)) 

for i in range(1,num_top_cols+1): 
    for el in xmldata.xpath('//xs:top_col[{}]/descendant::*'.format(i), namespaces=NSMAP): 
     if el.text.strip()!='':     # REMOVE EMPTY TEXT TAGS 
      inner[el.tag] = el.text.strip() 

    data.append(inner) 
    inner = {} 

df = pd.DataFrame(data) 

输出

print(df) 
# col11_1  col11_2 col8_1 col8_2  col1  col10 col12 col13_1 col2 col3 col4 col5 col6 col7 col9 
# 0  2010 AB 20/SEC001  2010 2016 00032000 test_name pqr 000330 N 0 3 N I AA N 
# 1 2016026 rty-qwe-01  2000 26000  03985  temp2 perrl 0117203 N 0 3 N a 9AA N 
# 2  8965 147A-254-044  7896 NaN  00985  mjkl rtyyu 45612 N 0 3 N NaN yuio N 
# 3 52369 ui 247/mh45 145ghg7 NaN  78965  ghyuio trwer  9874 N 0 5 N NaN 23rt N 

由于descendants::*的性能挑战,请考虑递归调用以首先遍历所有desce ndants然后再调用捕获数据帧列的父/子/孙名称。一定要现在使用的OrderedDict

from collections import OrderedDict 

#... same as above XML setup ... # 

def recursiveParse(curr_elem, curr_inner):  
    if len(curr_elem.xpath('/*')) > 0:   
     for child_elem in curr_elem:    
      curr_inner[child_elem.tag] = child_elem.text 
      inner[i.tag] = i.text 
      if child_elem.attrib is not None:     
       for attrib in child_elem.attrib: 
        inner[attrib] = child_elem.attrib[attrib] 
      recursiveParse(child_elem, curr_inner) 

    return(curr_inner) 

for el in xmldata.xpath('//xs:top_col', namespaces=NSMAP): 
    for i in el:   
     inner[i.tag] = i.text 
     if i.attrib is not None: 
      for attrib in i.attrib: 
       inner[attrib] = i.attrib[attrib]     
     recursiveParse(i, inner) 

    data.append(inner) 
    inner = {} 

df = pd.DataFrame(data) 

colnames = [] 
def recursiveNames(curr_elem, curr_inner, num):  
    if len(curr_elem.xpath('/*')) > 0:   
     for child_elem in curr_elem:  
      tmp = re.sub(r'{.*}', '', child_elem.tag)    
      curr_inner.append(colnames[num-1] +'.'+ tmp) 
      if child_elem.attrib is not None:     
       for attrib in child_elem.attrib: 
        curr_inner.append(curr_inner[len(curr_inner)-1] +'.'+ attrib) 
      recursiveNames(child_elem, curr_inner, len(colnames)) 

    return(curr_inner)   

for el in xmldata.xpath('//xs:top_col[1]', namespaces=NSMAP): 
    for i in el:     
     tmp = re.sub(r'{.*}', '', i.tag) 
     colnames.append(tmp) 
     recursiveNames(i, colnames, len(colnames)) 

df.columns = colnames 

输出

print(df) 
#  col1 col2 col3 col4 col5 col6 col7     col8 col8.col8_1 col8.col8_1.sName col8.col8_2 col9  col10     col11 col11.col11_1 col11.col11_2 col12     col13 col13.col13_1 
# 0 00032000 N 0 3 N I AA \n       2010    pqrst  2016 N test_name \n       2010 AB 20/SEC001 pqr \n       000330 
# 1  03985 N 0 3 N a 9AA \n       2000    NaN  26000 N  temp2 \n       2016026 rty-qwe-01 perrl \n       0117203 
# 2  00985 N 0 3 N NaN yuio \n       7896    NaN   NaN N  mjkl \n       8965 147A-254-044 rtyyu \n       45612 
# 3  78965 N 0 5 N NaN 23rt \n      145ghg7    NaN   NaN N  ghyuio \n       52369 ui 247/mh45 trwer \n       9874 

最后,在一个循环中集成该处理和原始的XML解析所有通过目录中的所有XML文件进行迭代。但是,请确保将所有数据帧保存在数据框列表中,然后使用pd.concat()`追加/堆栈。

import # modules 

dfList = [] 
for f in os.list.dir('/path/to/XML/files'): 
    #...xml parse... (passing in f for file name in parse()) 
    #...dataframe build with recursive calls... 

    dfList.append(df) 

finaldf = pd.concat(dfList) 
+0

远远胜过我,非常感谢!一个问题,如果我们有高等级的儿童在等级制中?是否有任何标准的方法来遍历所有的子小孩? – user07

+0

好问题,请参阅使用XPath的'descendant :: *'更新扩展,其中通过其节点索引遍历每个''并解析其所有后代。 – Parfait

+0

你的XML有多大?超过1 GB? *你的速度有多慢?而且,属性和文本是非常不同的。您的示例XML不包含属性或试图解析它们。始终发布**实际**数据的真实示例。 – Parfait

0

您好我已经找到了上述问题的答案,发布它,所以它可以对他人有所帮助:

xml_data = open('test.xml').read().encode('utf8') 

    def xml2df(xml_data): 
     tree = et.parse(xml_data) 
     all_records= [] 
     result= {} 
     for el in tree.iterfind("./row:agent1/row:agent2/row3:agent3/xs:top_col/",namespaces): 

      for r in el: 

       if '}' in r.tag: 
        r.tag = r.tag.split('}', 1)[1] 
      for i in el.iterfind('*'): 

       for s in i: 

        s.tag = s.tag.split('}',1)[1] 
        s.tag = i.tag +"."+s.tag    

       result[i.tag] = i.text 

       for j in i.iterfind('*'): 
        result[j.tag] = j.text 

      all_records.append(result) 

      result= {} 

     df = pd.DataFrame(data) 
     return df 
    df1 = xml2df(xml_data) 
    df1