获取标签内的全部内容，包括html标签

import lxml.html as PARSER 
from lxml.html import fromstring 

data = """<TextFormat>06</TextFormat> 
<Text><![CDATA[<html><body><p>Ducdame was John Cowper Powys<p>other text</p></p></body></html>]]></Text>""" 
root = PARSER.fromstring(data) 

for ele in root.getiterator(): 
    if ele.tag == 'text': 
     print ele.text_content()

这就是我现在得到的 - > Ducdame是John Cowper Powysother的文本。获取标签内的全部内容，包括html标签

但我需要“文本”标签中的全部内容。这是我期待的结果。

<![CDATA[<html><body><p>Ducdame was John Cowper Powys<p>other text</p></p></body></html>]]>

我试过lxml，BeautifulSoup但没有得到我期待的结果。我真的需要这个帮助。

由于

来源

2014-02-18 Tanveer Alam

这不是工作，因为你的数据编码不正确。您不能将XML语法元素的字符串用作XML中的字符串。编码< and >为<和&gr;等，它将工作。 – Michael

其实这是从.onx文件格式输入的，但我不知道我应该如何解析它。所以我尝试使用lxml库。但是，这正是我从我的输入文件中得到的输入。 –

这里以LXML为例。为了找到正确的标签使用XPath，这里.//text：

from lxml import html 
from lxml import etree 

text = """<TextFormat>06</TextFormat> 
<Text><![CDATA[<html><body><p>Ducdame was John Cowper Powys<p>other text</p></p></body> </html>]]></Text>""" 

tree = html.fromstring(text) 
tags = tree.xpath('.//text') 

text_tag = tags[-1] 
print etree.tostring(text_tag)

输出

'<text><p>Ducdame was John Cowper Powys</p><p>other text</p></text>'

如果您需要CDATA也可以找到下面的帖子有用：How to output CDATA using ElementTree

来源

2014-02-18 12:19:48 Jon

先生，如果可能的话，你能告诉我如何获得CDATA仅仅是这个例子。 –

此示例下面与minidom模块工作。

import xml.dom.minidom 

data = """<Text><![CDATA[<html><body><p>Ducdame was John Cowper Powys<p>other text</p></p></body></html>]]></Text>""" 

p = xml.dom.minidom.parseString(data) 
p = p.childNodes[0] 
p = p.childNodes[0] 
print p.toxml()

来源

2014-02-18 11:31:18 jorispilot

谢谢先生，这正是我期待的。但是，我应该如何遍历“文本”标签。假设我的文件有两个标签。 02 <！[CDATA [

Ducdame是约翰·波伊斯说说

其他文本

]]> “”” 那么我将如何达到 “” 标签 –

获取标签内的全部内容，包括html标签

回答

相关问题