2015-12-16 44 views
0

混合想象以下文字:ElementTree的文本与标签

<description> 
the thing <b>stuff</b> is very important for various reasons, notably <b>other things</b>. 
</description> 

我将如何管理与etree接口解析呢?具有description标记,.text属性只返回第一个单词 - the.getchildren()方法返回<b>元素,但不是文本的其余部分。

非常感谢!

回答

1

获取.text_content()。使用lxml.html工作样本:

from lxml.html import fromstring 

data = """ 
<description> 
the thing <b>stuff</b> is very important for various reasons, notably <b>other things</b>. 
</description> 
""" 

tree = fromstring(data) 

print(tree.xpath("//description")[0].text_content().strip()) 

打印:

the thing stuff is very important for various reasons, notably other things. 

我忘了,虽然指定的一件事,抱歉。我的理想分析版本将包含一个小节列表:[normal(“the thing”),bold(“stuff”),normal(“....”)],这对lxml.html库是否可行?

假设你只有文本节点和里面的描述b元素:

for item in tree.xpath("//description/*|//description/text()"): 
    print([item.strip(), 'normal'] if isinstance(item, basestring) else [item.text, 'bold']) 

打印:

['the thing', 'normal'] 
['stuff', 'bold'] 
['is very important for various reasons, notably', 'normal'] 
['other things', 'bold'] 
['.', 'normal'] 
+0

我忘了,虽然指定的一件事,抱歉。我的理想解析版本将包含一个小节列表:[normal(“the thing”),bold(“stuff”),normal(“....”)],这可能与lxml.html库有关吗? –

+0

@DanielLovasko肯定,更新。 – alecxe

+0

哇,挺酷的。谢谢! @alecxe –