Python lxml，在输出HTML之前删除父元素（使用fragment_fromstring）

我使用lxml解析某些HTML片段（来自RSS提要），为了有效地执行此操作，我使用create_parent='div'。当我稍后输出HTML时，我不希望将父div包含在内，因为使用我的html布局，它最终成为div中的div，完全不透明。Python lxml，在输出HTML之前删除父元素（使用fragment_fromstring）

的代码是现在：

from lxml.html import fragment_fromstring 

html = fragment_fromstring(html_string, create_parent = 'div') 

for tag in html.xpath('//*[@class]'): 
    tag.attrib.pop('class') 
for tag in html.xpath('//*[@id]'): 
    tag.attrib.pop('id') 

return lxml.html.tostring(html)

TL; DR：我怎么去除包装DIV时输出？

来源

2013-06-29 Alexander Kuzmin

这可能是答案; “移除包装div”通过跨过它并传递子节点：'lxml.etree.tostring（html_doc.xpath（'*'）[0]）''。警告：未经测试的代码。只使用python lxml 15年。对这些代码进行测试的人应该写出答案。 – Phlip

提取子元素。

return '\n'.join(lxml.html.tostring(x) for x in html.iterchildren())

来源

2013-06-29 15:08:01 falsetru

但是，这不是提取文本节点 –

@Hemant_Negi，你想要的东西像：'html.text_content（）。strip（）'？ – falsetru

我想要元素里面的所有内容。可以包含html节点作为文本。我认为这只是返回文字。 –

Python lxml，在输出HTML之前删除父元素（使用fragment_fromstring）

回答

相关问题