2016-09-07 101 views
1

以下代码能够从以下路由器链接中提取PE。但是,我的方法并不稳健,因为另一只股票的网页有两条线较少,导致数据转移。我怎么能遇到这个问题。我想直接指出PE的部分来提取数据,但不知道如何去做。 链接1:http://www.reuters.com/finance/stocks/financialHighlights?symbol=MYEG.KL 链接2:http://www.reuters.com/finance/stocks/financialHighlights?symbol=ANNJ.KLPython:用于提取内容的lxml xpath

from lxml import html 
import lxml 

page2 = requests.get('http://www.reuters.com/finance/stocks/financialHighlights?symbol=MYEG.KL') 
treea = html.fromstring(page2.content) 
tree4 = treea.xpath('//td[@class]/text()') 
PE= tree4[37] 

这是我希望让网页的任何更改将不会受到影响的代码只能提取这两个部分。

<tr class="stripe"> 
       <td>P/E Ratio (TTM)</td> 
       <td class="data">36.79</td> 
       <td class="data">25.99</td> 
       <td class="data">21.70</td> 
      </tr> 

回答

0

使用文本找到的第TD然后提取同级TD的

treea.xpath('//td[contains(.,"P/E Ratio")]/following-sibling::td/text()') 

不管,将工作:

In [8]: page2 = requests.get('http://www.reuters.com/finance/stocks/financialHighlights?symbol=MYEG.KL') 

In [9]: treea = html.fromstring(page2.content)  
In [10]: tree4 = treea.xpath('//td[contains(.,"P/E Ratio")]/following-sibling::td/text()') 

In [11]: print(tree4) 
['36.79', '25.99', '21.41'] 

In [12]: page2 = requests.get('http://www.reuters.com/finance/stocks/financialHighlights?symbol=ANNJ.KL') 
In [13]: treea = html.fromstring(page2.content) 

In [14]: tree4 = treea.xpath('//td[contains(.,"P/E Ratio")]/following-sibling::td/text()') 

In [15]: print(tree4) 
['--', '25.49', '17.30']