Beautifulsoup网页抓取

在我以前的文章中，我想抓取HKJC上的一些赛马数据。感谢Dmitriy Fialkovskiy的帮助，我通过稍微修改给定的代码来实现它。然而，当我试图了解背后的逻辑，有一个线无法解释说：Beautifulsoup网页抓取

from bs4 import BeautifulSoup as BS 
import requests 
import pandas as pd 

url_list = ['http://www.hkjc.com/english/racing/horse.asp?HorseNo=S217','http://www.hkjc.com/english/racing/horse.asp?HorseNo=A093','http://www.hkjc.com/english/racing/horse.asp?HorseNo=V344','http://www.hkjc.com/english/racing/horse.asp?HorseNo=V077', 'http://www.hkjc.com/english/racing/horse.asp?HorseNo=P361', 'http://www.hkjc.com/english/racing/horse.asp?HorseNo=T103'] 


res=[] #placing res outside of loop 
for link in url_list: 
    r = requests.get(link) 
    r.encoding = 'utf-8' 

    html_content = r.text 
    soup = BS(html_content, 'lxml') 


    table = soup.find('table', class_='bigborder') 
    if not table: 
     continue 

    trs = table.find_all('tr') 

    if not trs: 
     continue #if trs are not found, then starting next iteration with other link 


    headers = trs[0] 
    headers_list=[] 
    for td in headers.find_all('td'): 
     headers_list.append(td.text) 
    headers_list+=['Season'] 
    headers_list.insert(19,'pseudocol1') 
    headers_list.insert(20,'pseudocol2') 
    headers_list.insert(21,'pseudocol3') 

    row = [] 
    season = '' 
    for tr in trs[1:]: 
     if 'Season' in tr.text: 
      season = tr.text 

     else: 
      tds = tr.find_all('td') 
      for td in tds: 
       row.append(td.text.strip('\n').strip('\r').strip('\t').strip('"').strip()) 
      row.append(season.strip()) 
      res.append(row) 
      row=[] 

res = [i for i in res if i[0]!=''] #outside of loop 

df=pd.DataFrame(res, columns=headers_list) #outside of loop 
del df['pseudocol1'],df['pseudocol2'],df['pseudocol3'] 
del df['VideoReplay']

我不知道什么是在else条件增加了重复row =[]的目的，为什么会作品。谢谢。

来源

2017-07-06 JAY.Y

作为一个有趣的练习，用'row.clear（）'替换'row = []'并观察魔法。 –

res成为：[[]，[]，[]，...]这是什么意思？ –

循环内部的row=[]清除列表，使其重新变空。由于该列表在for循环之前被声明过一次，因此它将继承在另一个for迭代中附加的元素。做row=[]再次清除它到一个空的列表。

来源

2017-07-06 13:20:15

应该补充的是，你将它分配给一个新的空白列表，而不仅仅是清除它。你可以用del行[：]清除它，但是你的res会受到影响。 – corn3lius

您的意思是，如果没有row = []，结果将是res = [['A']，['A'，'B']，['A'，'B'，'C']而不是[ ['A'，'B'，'C']]？ –

我看到它的方式，如果你没有重置row那么你总是会重复前面结果的存储，越来越多，res.append(row)就在上面。

来源

2017-07-06 13:22:05 Fabien

Beautifulsoup网页抓取

回答

相关问题