用Python刮掉雅虎财务损益表

我试图用Python从Yahoo Finance的损益表中刮取数据。具体来说，我们假设我想要most recent figure of Net Income of Apple。用Python刮掉雅虎财务损益表

数据结构在一堆嵌套的HTML表格中。我正在使用requests模块来访问它并检索HTML。

我使用BeautifulSoup 4筛选HTML结构，但我无法弄清楚如何得到这个数字。

Here是Firefox的分析截图。

我迄今为止代码：

from bs4 import BeautifulSoup 
import requests 

myurl = "https://finance.yahoo.com/q/is?s=AAPL&annual" 
html = requests.get(myurl).content 
soup = BeautifulSoup(html)

我尝试使用

all_strong = soup.find_all("strong")

然后拿到第17个元素，而这恰好是包含我想图中的一个，但是这从似乎远优雅。事情是这样的：

all_strong[16].parent.next_sibling 
...

当然，我们的目标是用BeautifulSoup来搜索名称的身影，我需要的（在这种情况下，“净利润”），然后抢在数字本身 HTML表格的同一行。

我真的很感激就如何解决这个任何想法，记住，我想申请的解决方案来检索一堆其他雅虎财经网页等数据。

SOLUTION /扩展：

通过@wilbur该解决方案如下工作，我在扩大它能够得到的值上的金融页面的任何提供任何图（即Income Statement ，Balance Sheet,Cash Flow Statement）任何上市公司。 My功能如下：

def periodic_figure_values(soup, yahoo_figure): 

    values = [] 
    pattern = re.compile(yahoo_figure) 

    title = soup.find("strong", text=pattern) # works for the figures printed in bold 
    if title: 
     row = title.parent.parent 
    else: 
     title = soup.find("td", text=pattern) # works for any other available figure 
     if title: 
      row = title.parent 
     else: 
      sys.exit("Invalid figure '" + yahoo_figure + "' passed.") 

    cells = row.find_all("td")[1:] # exclude the <td> with figure name 
    for cell in cells: 
     if cell.text.strip() != yahoo_figure: # needed because some figures are indented 
      str_value = cell.text.strip().replace(",", "").replace("(", "-").replace(")", "") 
      if str_value == "-": 
       str_value = 0 
      value = int(str_value) * 1000 
      values.append(value) 

    return values

的yahoo_figure变量是一个字符串。显然，这必须与Yahoo Finance上使用的图形名称完全相同。要通过soup变量，我用下面的函数首先：

def financials_soup(ticker_symbol, statement="is", quarterly=False): 

    if statement == "is" or statement == "bs" or statement == "cf": 
     url = "https://finance.yahoo.com/q/" + statement + "?s=" + ticker_symbol 
     if not quarterly: 
      url += "&annual" 
     return BeautifulSoup(requests.get(url).text, "html.parser") 

    return sys.exit("Invalid financial statement code '" + statement + "' passed.")

使用范例 - 我想从最后一个可用的损益表得到苹果公司的所得税费用：

print(periodic_figure_values(financials_soup("AAPL", "is"), "Income Tax Expense"))

输出：[19121000000, 13973000000, 13118000000]

你也可以得到来自soup期间的结束的日期，并创建一个字典磨片日期是关键，数字是值，但这会使这篇文章太长。到目前为止，这似乎为我工作，但我总是感谢建设性的批评。

来源

2016-02-16 JohnGalt

这是由多一点困难，因为“净收入”，在封闭在一个<strong>标签，如此忍受我，但我想这样的作品：

import re, requests 
from bs4 import BeautifulSoup 

url = 'https://finance.yahoo.com/q/is?s=AAPL&annual' 
r = requests.get(url) 
soup = BeautifulSoup(r.text, 'html.parser') 
pattern = re.compile('Net Income') 

title = soup.find('strong', text=pattern) 
row = title.parent.parent # yes, yes, I know it's not the prettiest 
cells = row.find_all('td')[1:] #exclude the <td> with 'Net Income' 

values = [ c.text.strip() for c in cells ]

values，在这种情况下，将包含在“净收入”行三个表格单元格（和，我想补充，可以很容易地转换成整数的 - 我只是喜欢他们保持了“”字符串）

In [10]: values 
Out[10]: [u'53,394,000', u'39,510,000', u'37,037,000']

当我在测试它Alphabet（GOOG） - 它不起作用，因为它们不显示I ncome声明我相信（https://finance.yahoo.com/q/is?s=GOOG&annual），但是当我检查Facebook（FB）时，数值正确返回（https://finance.yahoo.com/q/is?s=FB&annual）。

如果你想创建一个更加动态的脚本，你可以使用字符串格式化与任何你想要的股票代码格式化的URL，就像这样：

ticker_symbol = 'AAPL' # or 'FB' or any other ticker symbol 
url = 'https://finance.yahoo.com/q/is?s={}&annual'.format(ticker_symbol))

来源

2016-02-16 20:23:02 wpercy

非常感谢。迄今为止效果很好。现在我只需要让它变得更有活力。不只是关于股票，还包括同一股票的其他财务数据，以及检查最近的数据等等。但这是一个很好的开始。 – JohnGalt

用Python刮掉雅虎财务损益表

回答

相关问题