2016-08-24 61 views
0

需要从XML解析分层标签和所需的输出获得标签的值解析层次化XML标签

输入

<doc> 
<pid id="231"> 
    <label key="">Electronics</label> 
     <desc/> 
     <cid id="122"> 
     <label key="">TV</label> 
     </cid> 
     <desc/> 
     <cid id="123"> 
     <label key="">Computers</label> 
     <cid id="12433"> 
      <label key="">Lenovo</label> 
      </cid> 
      <desc/> 
      <cid id="12434"> 
      <label key="">IBM</label> 
      <desc/> 
      </cid> 
      <cid id="12435"> 
      <label key="">Mac</label> 
      </cid> 
      <desc/> 
    </cid> 
</pid> 
<pid id="7764"> 
    <label key="">Music</label> 
    <desc/> 
     <cid id="1224"> 
     <label key="">Play</label> 
     <desc/> 
      <cid id="341"> 
      <label key="">PQR</label> 
      </cid> 
      <desc/> 
     </cid> 
     <cid id="221"> 
     <label key="">iTunes</label> 
      <cid id="341"> 
      <label key="">XYZ</label> 
      </cid> 
      <desc/> 
      <cid id="515"> 
      <label key="">ABC</label> 
      </cid> 
      <desc/> 
     </cid> 
</pid> 
</doc> 

输出

Electornics/ 
Electornics/TV 
Electornics/Computers/Lenovo 
Electornics/Computers/IBM 
Electornics/Computers/Mac 
Music/ 
Music/Play/PQR 
Music/iTunes/XYZ 
Music/iTunes/ABC 

我有什么尝试过(in Python

import xml.etree.ElementTree as ET 
import os 
import sys 
import string 

def perf_func(elem, func, level=0): 
    func(elem,level) 
    for child in elem.getchildren(): 
     perf_func(child, func, level+1) 

def print_level(elem,level): 
    print '-'*level+elem.tag 

root = ET.parse('Products.xml') 
perf_func(root.getroot(), print_level) 

# Added find logic 
root = tree.getroot() 

for n in root.findall('doc') 
    l = n.find('label').text 
    print l 

与上面的代码,我能够得到的节点和它的水平(也就是标记的不是他们的价值)。也是所有标签的第一级。 需要一些建议(Perl/Python)关于如何继续使用输出中提到的格式来获得雇用结构。

+0

看看在etree'find'&'findall'功能,它需要一个XPath表达式 – FujiApple

+0

新增查找逻辑(编辑的问题 - 什么我想)......需要关于如何得到一些建议输出 – Debaditya

回答

2

我们将使用3个部分:按照它们出现的顺序查找所有元素,获取每个元素的深度,根据深度和顺序构建面包屑。

from lxml import etree 
xml = etree.fromstring(xml_str) 
elems = xml.xpath(r'//label') #xpath expression to find all '<label ...> elements 

# counts the number of parents to the root element 
def get_depth(element): 
    depth = 0 
    parent = element.getparent() 
    while parent is not None: 
     depth += 1 
     parent = parent.getparent() 
    return depth 

# build up the bread crumbs by tracking the depth 
# when a new element is entered, it replaces the value in the list 
# at that level and drops all values to the right 
def reduce_by_depth(element_list): 
    crumbs = [] 
    depth = 0 
    elem_crumb = ['']*10 
    for elem in element_list: 
     depth = get_depth(elem) 
     elem_crumb[depth] = elem.text 
     elem_crumb[depth+1:] = ['']*(10-depth-1) 
     # join all the non-empty string to get the breadcrumb 
     crumbs.append('/'.join([e for e in elem_crumb if e])) 
    return crumbs 

reduce_by_depth(elems) 

# output: 
['Electronics', 
'Electronics/TV', 
'Electronics/Computers', 
'Electronics/Computers/Lenovo', 
'Electronics/Computers/IBM', 
'Electronics/Computers/Mac', 
'Music', 
'Music/Play', 
'Music/Play/PQR', 
'Music/iTunes', 
'Music/iTunes/XYZ', 
'Music/iTunes/ABC'] 
+0

中提到的格式中的树结构非常感谢......碎屑逻辑真的很好:) – Debaditya