2017-01-12 66 views
0

我有一个表,我希望拿起所有的链接,通过链接,并在td class =马内刮取物品。刮表链接,点击链接和刮数据

主页所在的表是所有环节具有下面的代码:

<table border="0" cellspacing="0" cellpadding="0" class="full-calendar"> 
    <tr> 
     <th width="160">&nbsp;</th> 
     <th width="105"><a href="/FreeFields/Calendar.aspx?State=NSW">NSW</a></th> 
     <th width="105"><a href="/FreeFields/Calendar.aspx?State=VIC">VIC</a></th> 
     <th width="105"><a href="/FreeFields/Calendar.aspx?State=QLD">QLD</a></th> 
     <th width="105"><a href="/FreeFields/Calendar.aspx?State=WA">WA</a></th> 
     <th width="105"><a href="/FreeFields/Calendar.aspx?State=SA">SA</a></th> 
     <th width="105"><a href="/FreeFields/Calendar.aspx?State=TAS">TAS</a></th> 
     <th width="105"><a href="/FreeFields/Calendar.aspx?State=ACT">ACT</a></th> 
     <th width="105"><a href="/FreeFields/Calendar.aspx?State=NT">NT</a></th> 
    </tr> 


    <tr class="rows"> 
     <td> 
      <p><span>FRIDAY 13 JAN</span></p> 
     </td> 

       <td> 
        <p> 

          <a href="/FreeFields/Form.aspx?Key=2017Jan13,NSW,Ballina">Ballina</a><br> 

          <a href="/FreeFields/Form.aspx?Key=2017Jan13,NSW,Gosford">Gosford</a><br> 

        </p> 
       </td> 

       <td> 
        <p> 

          <a href="/FreeFields/Form.aspx?Key=2017Jan13,VIC,Ararat">Ararat</a><br> 

          <a href="/FreeFields/Form.aspx?Key=2017Jan13,VIC,Cranbourne">Cranbourne</a><br> 

        </p> 
       </td> 

       <td> 
        <p> 

          <a href="/FreeFields/Form.aspx?Key=2017Jan13,QLD,Doomben">Doomben</a><br> 

        </p> 
       </td> 

我目前拥有的代码来查找表并打印链接

from selenium import webdriver 
import requests 
from bs4 import BeautifulSoup 

#path to chromedriver 
path_to_chromedriver = '/Users/Kirsty/Downloads/chromedriver' 

#ensure browser is set to Chrome 
browser = webdriver.Chrome(executable_path= path_to_chromedriver) 

#set browser to Racing Australia Home Page 
url = 'http://www.racingaustralia.horse/' 
r = requests.get(url) 

soup=BeautifulSoup(r.content, "html.parser") 

#looks up to find the table & prints link for each page 
table = soup.find('table',attrs={"class" : "full-calendar"}). find_all('a') 
for link in table: 
     print link.get('href') 

想知道任何人都可以协助我如何获得代码点击表中的所有链接&对每个页面执行以下操作:

g data = soup.findall("td",{"class":"horse"}) 
for item in g_data: 
    print item.text 

在此先感谢

+0

你是什么意思的“点击链接”?意思是,进入链接的页面,然后抓取那里的所有链接? – Signal

+0

是的,所以表格由下面的数据组成,例如

\t \t \t \t \t \t​​

FRIDAY 1月13日

\t \t \t​​

Ballina
\t \t \t \t \t \t \t \t \t \t \t \t \t \t \t Gosford
\t \t \t \t \t \t \t

\t \t \t \t​​

\t \t \t \t \t \t Ararat
\t \t \t \t \t \t \t \t \t \t \t \t \t \t \t Cranbourne

Kirsty

+0

@KirstyDent请把任何相关的数据,就像在您的评论的HTML以上,到问题本身这样以后的读者会更容易找到。 – JeffC

回答

0
import requests, bs4, re 
from urllib.parse import urljoin 
start_url = 'http://www.racingaustralia.horse/' 

def make_soup(url): 
    r = requests.get(url) 
    soup = bs4.BeautifulSoup(r.text, 'lxml') 
    return soup 

def get_links(url): 
    soup = make_soup(url) 
    a_tags = soup.find_all('a', href=re.compile(r"^/FreeFields/")) 
    links = [urljoin(start_url, a['href'])for a in a_tags] # convert relative url to absolute url 
    return links 

def get_tds(link): 
    soup = make_soup(link) 
    tds = soup.find_all('td', class_="horse") 
    if not tds: 
     print(link, 'do not find hours tag') 
    else: 
     for td in tds: 
      print(td.text) 

if __name__ == '__main__': 
    links = get_links(start_url) 
    for link in links: 
     get_tds(link) 

出来:

http://www.racingaustralia.horse/FreeFields/GroupAndListedRaces.aspx do not find hours tag 
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=NSW do not find hours tag 
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=VIC do not find hours tag 
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=QLD do not find hours tag 
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=WA do not find hours tag 
....... 

WEARETHECHAMPIONS 
STORMY HORIZON 
OUR RED JET 
SAPPER TOM 
MY COUSIN BOB 
ALL TOO HOT 
SAGA DEL MAR 
ZIGZOFF 
SASHAY AWAY 
SO SHE IS 
MILADY DUCHESS 

BS4 +请求能满足您的需要。

+0

非常感谢!现在就试试这个:) – Kirsty