这个XPath为什么不工作？

我正试图获得公司名称，部门和行业的股票。我下载了'https://finance.yahoo.com/q/in?s={}+Industry'.format(sign)的HTML，然后尝试用.xpath()从lxml.html解析它。这个XPath为什么不工作？

要获取我试图抓取的数据的XPath，我在Chrome中前往该网站，右键单击该项目，单击Inspect Element，右键单击突出显示的区域，然后单击Copy XPath。这在过去一直适用于我。

import requests 
from lxml import html 

page_p = 'https://finance.yahoo.com/q/in?s=AAPL+Industry' 
name_p = '//*[@id="yfi_rt_quote_summary"]/div[1]/div/h2/text()' 
sect_p = '//*[@id="yfncsumtab"]/tbody/tr[2]/td[1]/table[2]/tbody/tr/td/table/tbody/tr[1]/td/a/text()' 
indu_p = '//*[@id="yfncsumtab"]/tbody/tr[2]/td[1]/table[2]/tbody/tr/td/table/tbody/tr[2]/td/a/text()' 

page = requests.get(page_p) 
tree = html.fromstring(page.text) 

name = tree.xpath(name_p) 
sect = tree.xpath(sect_p) 
indu = tree.xpath(indu_p) 

print('Name: {}\nSector: {}\nIndustry: {}'.format(name, sect, indu))

哪个给出了这样的输出：

这个问题可以用下面的代码（我使用的是苹果公司为例）再现

Name: ['Apple Inc. (AAPL)'] 
Sector: [] 
Industry: []

它没有遇到任何下载困难，因为它能够检索name，但其他两个不起作用。如果我有tr[1]/td/a/text()和tr[1]/td/a/text()取代它们的路径，分别是返回此：

Name: ['Apple Inc. (AAPL)'] 
Sector: ['Consumer Goods', 'Industry Summary', 'Company List', 'Appliances', 'Recreational Goods, Other'] 
Industry: ['Electronic Equipment', 'Apple Inc.', 'AAPL', 'News', 'Industry Calendar', 'Home Furnishings & Fixtures', 'Sporting Goods']

很显然，我可以只切出来的第一个项目每个列表中获得我所需要的数据。

我不明白的是，当我添加tbody/开始（//tbody/tr[#]/td/a/text()）再次失败，即使在Chrome控制台清楚地表明这两个tr S作为是一个tbody元素的儿童。

Chrome console showing HTML hierarchy

为什么会出现这种情况？

来源

2015-04-05 spelchekr

浏览器解析HTML并从中构建元素树;在那个过程中，他们会插入输入HTML文档中可能丢失的元素。

在这种情况下，<tbody>元素不在源代码HTML中。您的浏览器会插入它们，因为如果缺失，它们将隐含在结构中。但是LXML不会插入它们。

您的浏览器工具并不是用于构建XPath查询的最佳工具。

卸下tbody/路径元素产生你正在寻找的结果：

>>> sect_p = '//*[@id="yfncsumtab"]/tr[2]/td[1]/table[2]/tr/td/table/tr[1]/td/a/text()' 
>>> indu_p = '//*[@id="yfncsumtab"]/tr[2]/td[1]/table[2]/tr/td/table/tr[2]/td/a/text()' 
>>> tree.xpath(sect_p) 
['Consumer Goods'] 
>>> tree.xpath(indu_p) 
['Electronic Equipment']

来源

2015-04-05 21:53:00

这个XPath为什么不工作？

回答

相关问题