2012-06-22 21 views
1

解析XML块考虑下面的XML:与LXML

<language>en-US</language> 
<provider>VenturesLLC</provider> 
<video> 
    <original_spoken_locale>en-US</original_spoken_locale> 
    <vendor_offer_code>TEST_VENDOR</vendor_offer_code> 
    <release_date>2011-01-15</release_date> 
    <title>Moving Forward</title> 
    <vendor_id>ASDF_ING_2012</vendor_id> 
</video> 

我期待检索整个<video>块。然而,当我这样做:

>>> f=open('metadata.xml') 
>>> contents=f.read() 
>>> node=etree.fromstring(contents) 
>>> node.xpath("//*[local-name()='video']")[0].text 
'\n 

需要注意的是,如果我不喜欢的东西node.xpath("//*[local-name()='original_spoken_locale']")[0].text我得到的'en-US'正确的值。如何将我拉这个完整的文本,所以我可以得到:

text = """  
<video> 
    <original_spoken_locale>en-US</original_spoken_locale> 
    <vendor_offer_code>TEST_VENDOR</vendor_offer_code> 
    <release_date>2011-01-15</release_date> 
    <title>Moving Forward</title> 
    <vendor_id>ASDF_ING_2012</vendor_id> 
</video>""" 

回答

2

.text呼叫没有工作,因为你的视频节点没有文字 - 它具有其它子节点。您需要将这些节点转换为使用tostring

In [1]: from lxml import etree 

In [2]: xml = '''<xml> 
    ...: <language>en-US</language> 
    ...: <provider>VenturesLLC</provider> 
    ...: <video> 
    ...:  <original_spoken_locale>en-US</original_spoken_locale> 
    ...:  <vendor_offer_code>TEST_VENDOR</vendor_offer_code> 
    ...:  <release_date>2011-01-15</release_date> 
    ...:  <title>Moving Forward</title> 
    ...:  <vendor_id>ASDF_ING_2012</vendor_id> 
    ...: </video></xml>''' 

In [3]: tree = etree.fromstring(xml) 

In [4]: vid = tree.xpath('//video')[0] 

In [5]: etree.tostring(vid, pretty_print=True) 
Out[5]: '<video>\n <original_spoken_locale>en-US</original_spoken_locale>\n <vendor_offer_code>TEST_VENDOR</vendor_offer_code>\n <release_date>2011-01-15</release_date>\n <title>Moving Forward</title>\n <vendor_id>ASDF_ING_2012</vendor_id>\n</video>\n' 

In [6]: print _ 
<video> 
    <original_spoken_locale>en-US</original_spoken_locale> 
    <vendor_offer_code>TEST_VENDOR</vendor_offer_code> 
    <release_date>2011-01-15</release_date> 
    <title>Moving Forward</title> 
    <vendor_id>ASDF_ING_2012</vendor_id> 
</video> 
+0

你可以用'node.text_content()'得到一个节点下的所有文本作为单个字符串,或'node.itertext()的字符串'遍历每个文本节点的内容分别。 – spiralx