无法解析Google财经html

-1

我试图从Google Finance使用python3刮取一些股票价格和变体，但我无法弄清楚如果页面或我的正则表达式出现问题。我在想，整个页面中的svg图形或许多脚本标记都会使正则表达式解析器无法正确分析代码。无法解析Google财经html

我在许多在线正则表达式构建器/测试器上测试了这个正则表达式，它看起来没问题。无论如何，正如专为HTML设计的正则表达式一样。

的谷歌财经页面我在测试这一点是https://www.google.com/finance?q=NYSE%3AAAPL 我的Python代码如下

import urllib.request 
import re 
page = urllib.request.urlopen('https://www.google.com/finance?q=NYSE%3AAAPL') 
text = page.read().decode('utf-8') 
m = re.search("id=\"price-panel.*>(\d*\d*\d\.\d\d)</span>.*\((-*\d\.\d\d%)\)", text, re.S) 
print(m.groups())

这将提取的股票价格及其变动百分比。我一直在使用python2 + BeautifulSoup也试过，像这样

soup.find(id='price-panel')

但即使是这样一个简单的查询返回空。这尤其是为什么我认为这有点奇怪的HTML。

而这里的是我的目标

<div id="price-panel" class="id-price-panel goog-inline-block"> 
<div> 
<span class="pr"> 
<span class="unchanged" id="ref_22144_l"><span class="unchanged">96.41</span><span></span></span> 
</span> 
<div class="id-price-change nwp goog-inline-block"> 
<span class="ch bld"><span class="down" id="ref_22144_c">-1.13</span> 
<span class="down" id="ref_22144_cp">(-1.16%)</span> 
</span> 
</div> 
</div> 
<div> 
<span class="nwp"> 
Real-time: 
&nbsp; 
<span class="unchanged" id="ref_22144_ltt">3:42PM EDT</span> 
</span> 
<div class="mdata-dis"> 
<span class="dis-large"><nobr>NASDAQ 
real-time data - 
<a href="//www.google.com/help/stock_disclaimer.html#realtime" class="dis-large">Disclaimer</a> 
</nobr></span> 
<div>Currency in USD</div> 
</div> 
</div> 
</div>

我想知道如果你们任何人都遇到类似的问题，此页面和/或可以计算出，如果有什么事，HTML中最重要的位我的代码错了。提前致谢！

来源

2014-10-16 Slpk

仅供参考，https://www.quandl.com/help/api-for-stock-data我不知道Google Finance需要什么，但您可以从此处获得。 – user2023861 2014-10-16 20:50:23

@ user2023861谢谢，我会检查出来的。我曾经搜索过其他来源的数据，但没有发现我拥有的所有股票。我试图从除纽约证券交易所以外的交易所获得股票。 – Slpk 2014-10-17 14:01:40

你可以尝试不同的URL，这将是更容易分析，如：http://www.google.com/finance/info?q=AAPL

美中不足的是，谷歌曾表示，在大众消费应用程序中使用这个API是对他们的服务条款。也许有一种方法可以让Google使用？

来源

2014-10-16 21:34:37

酷，它肯定比解析HTML好多了。我认为有用的另一个隐藏来源是https://www.google.com/finance/getprices?q=AAPL&i=120&p=5d&f=c&df=cpct&auto=1 – Slpk 2014-10-17 14:04:30

我设法使用BeautifulSoup，在最初发布的链接上工作。

这是我finaly使用的代码位：

response = urllib2.urlopen('https://www.google.com/finance?q=NYSE%3AAAPL') 
html = response.read() 
soup = BeautifulSoup(html, "lxml") 
aaplPrice = soup.find(id='price-panel').div.span.span.text 
aaplVar = soup.find(id='price-panel').div.div.span.find_all('span')[1].string.split('(')[1].split(')')[0] 
aapl = aaplPrice + ' ' + aaplVar

我无法得到它与BeautifulSoup工作之前，因为我其实是试图解析表中https://www.google.com/finance?q=NYSE%3AAAPL%3BNYSE%3AGOOG这个页面，而不是一个我张贴。我的问题描述的两种方法都不适用于此页面。

来源

2014-10-17 14:36:30 Slpk

无法解析Google财经html

回答

相关问题