如何使用lxml从html解析文本？

<p> 
    Glassware veteran 
    <strong>Corning </strong> 
    (
    <span class="ticker"> 
     NYSE: 
     <a class="qsAdd qs-source-isssitthv0000001" href="http://caps.fool.com/Ticker/GLW.aspx?source=isssitthv0000001" data-id="203758">GLW</a> 
    </span> 
    <a class="addToWatchListIcon qsAdd qs-source-iwlsitbut0000010" href="http://my.fool.com/watchlist/add?ticker=&source=iwlsitbut0000010" title="Add to My Watchlist"> </a> 
    ) has fallen on hard times lately. Is it time to give up on the stock, or will Corning have a banana and a comeback? 
</p>

我想得到“玻璃器皿老手”和“最近陷入了困境，现在是放弃股票的时候了，还是康宁会有香蕉和卷土重来？如何使用lxml从html解析文本？

使用代码

tnode = root.xpath("/p") 
content = tnode.text

我只能得到 “玻璃器皿老将”，为什么呢？

来源

2012-12-06 yinyao

像这样的东西可能会得到你想要的东西：

>>> tnode = root.xpath('/p') 
>>> content = tnode.xpath('text()') 
>>> print ''.join(content) 

Glassware veteran 

(


) has fallen on hard times lately. Is it time to give up on the stock, or will Corning have a banana and a comeback? 
>>>

如果你想文本节点的所有，只需使用//text()代替text()：

>>> print ' '.join([x.strip() for x in ele.xpath('//text()')]) 
Glassware veteran Corning (NYSE: GLW ) has fallen on hard times lately. Is it time to give up on the stock, or will Corning have a banana and a comeback?

来源

2012-12-06 15:13:18 larsks

非常感谢你。但是现在我遇到了一个新问题，我希望得到“玻璃器皿老兵康宁（纽约证券交易所代码：GLW）最近陷入了困境，现在是放弃股票的时候了，还是康宁会有香蕉和卷土重来？使用代码：tnode = root.xpath（'/ p |/p/strong |/p/a |/p/span'）content = tnode.xpath（'text（）'）print''.join（content）结果是：“Glassware老将（）最近陷入了困境，是放弃股票的时候了，还是康宁会有香蕉和卷土重来呢？”康宁纽约证券交易所股票代码： GLW“你有什么想法吗？谢谢。 – yinyao

我已经更新了我的答案。 – larsks

如何使用lxml从html解析文本？

回答

相关问题