Python - 网页抓取数据表，覆盖多个网址

Python非常新手，但我真的想学习它。当时我正在玩弄一个网站的数据，并且觉得我很接近想出解决方案。问题在于它只会返回url的第一个页面，即使通过代码中的url也会在每次迭代中更改页码。Python - 网页抓取数据表，覆盖多个网址

我使用的网站是http://etfdb.com/etf/SPY/#etf-holdings&sort_name=weight&sort_order=desc&page=1，我试图刮具体的数据表是SPY控股（它说：506次增持，然后列出了苹果，微软等）

正如你会发现，在数据表有一堆页面（并且这根据股票代码变化 - 但是为了这个目的，尽管有34页的SPY，它并不总是34页）。首先显示15家公司，然后当您点击2（查看接下来的15家公司）时，url页面=上升1。

#to break up html 
from bs4 import BeautifulSoup as soup 
from urllib.request import urlopen as uReq 
import csv 
import math 

#goes to url - determines the number of holdings and the number of pages the data table will need to loop through 
my_url = "http://etfdb.com/etf/SPY/#etf- 
holdings&sort_name=weight&sort_order=desc&page=1" 
uClient = uReq(my_url) 
page_html = uClient.read() 
uClient.close() 
page_soup = soup(page_html,"html.parser") 
#goes to url - scrapes from another section of the page and finds 506 holdings 
num_holdings_text = page_soup.find('span',{'class': 'relative-metric-bubble-data'}) 
num_holdings = num_holdings_text.text 
number_of_loops = int(num_holdings) 
num_of_loops = number_of_loops/15 
#goes to url - because the table shows 15 holdings at a time, this calcs number of pages I'll need to loop through 
num_of_loops = math.ceil(num_of_loops) 
holdings = [] 
for loop in range(1,num_of_loops+1): 
    my_url = "http://etfdb.com/etf/SPY/#etf-holdings&sort_name=weight&sort_order=desc&page=" + str(loop) 
    uClient = uReq(my_url) 
    page_html = uClient.read() 
    uClient.close() 
    page_soup = soup(page_html, "html.parser") 
    table = page_soup.find('table', { 
    'class': 'table mm-mobile-table table-module2 table-default table-striped table-hover table-pagination'}) 
    table_body = table.find('tbody') 
    table_rows = table_body.find_all('tr') 
    for tr in table_rows: 
     td = tr.find_all('td') 
     row = [i.text.strip() for i in td] 
     holdings.append(row) 
     print(row) 
    print (holdings) 


    with open('etfdatapull2.csv','w',newline='') as fp: 
     a = csv.writer(fp, delimiter = ',') 
     a.writerows(holdings)

，我再次遇到的问题是，它只是不断返回的第一个页面（例如，它始终只是返回苹果 - GE），即使该链接的更新。

非常感谢您的帮助。再次，这是非常新的，所以请尽量减少它！

来源

2017-08-30 Steve Butler