2016-07-29 150 views
0

这里是我的Python代码LXMLLXML删除展开文本标记内

import urllib.request 
from lxml import etree 
#import lxml.html as html 
from copy import deepcopy 
from lxml import etree 
from lxml import html 


some_xml_data = "<span>text1<div>ddd</div>text2<div>ddd</div>text3</span>" 
root = etree.fromstring(some_xml_data) 
[c] = root.xpath('//span') 
print(etree.tostring(root)) #b'<span>text1<div>ddd</div>text2<div>ddd</div>text3</span>' #output as expected 
#but if i do some changes 
for e in c.iterchildren("*"): 
    if e.tag == 'div': 
     e.getparent().remove(e) 

print(etree.tostring(root)) #b'<span>text1</span>' text2 and text3 removed! how to prevent this deletion? 

它看起来像后,我做LXML树一些变化(删除一些标签) LXML还删除了一些解开的文字!如何防止lxml这样做并保存unwrpapped文本?

回答

1

节点的文本被称为,他们可以通过附加于母公司的文本被保留,这里是一个示例:

In [1]: from lxml import html 

In [2]: s = "<span>text1<div>ddd</div>text2<div>ddd</div>text3</span>" 
    ...: 

In [3]: tree = html.fromstring(s) 

In [4]: for node in tree.iterchildren("div"): 
    ...:  if node.tail: 
    ...:   node.getparent().text += node.tail 
    ...:  node.getparent().remove(node) 
    ...:  

In [5]: html.tostring(tree) 
Out[5]: b'<span>text1text2text3</span>' 

我用html因为它更可能比XML结构。你可以简单地使用iterchildrendiv来避免额外检查标签。