2013-05-20 36 views
0

我导入链接boxscores从这个网页如何自动化这个beautifulsoup进口

http://www.covers.com/pageLoader/pageLoader.aspx?page=/data/wnba/teams/pastresults/2012/team665231.html 

这就是我现在做它。我从第一页获得链接。

url = 'http://www.covers.com/pageLoader/pageLoader.aspx?page=/data/wnba/teams/pastresults/2012/team665231.html' 

boxurl = urllib2.urlopen(url).read() 
soup = BeautifulSoup(boxurl) 

boxscores = soup.findAll('a', href=re.compile('boxscore')) 
basepath = "http://www.covers.com" 
pages=[]   # This grabs the links from the page 
for a in boxscores: 
pages.append(urllib2.urlopen(basepath + a['href']).read()) 

然后在新窗口中,我会这样做。

newsoup = pages[1] # I am manually changing this every time 

soup = BeautifulSoup(newsoup) 
def _unpack(row, kind='td'): 
    return [val.text for val in row.findAll(kind)] 

tables = soup('table') 
linescore = tables[1] 
linescore_rows = linescore.findAll('tr') 
roadteamQ1 = float(_unpack(linescore_rows[1])[1]) 
roadteamQ2 = float(_unpack(linescore_rows[1])[2]) 
roadteamQ3 = float(_unpack(linescore_rows[1])[3]) 
roadteamQ4 = float(_unpack(linescore_rows[1])[4]) # add OT rows if ??? 
roadteamFinal = float(_unpack(linescore_rows[1])[-3]) 
hometeamQ1 = float(_unpack(linescore_rows[2])[1]) 
hometeamQ2 = float(_unpack(linescore_rows[2])[2]) 
hometeamQ3 = float(_unpack(linescore_rows[2])[3]) 
hometeamQ4 = float(_unpack(linescore_rows[2])[4]) # add OT rows if ??? 
hometeamFinal = float(_unpack(linescore_rows[2])[-3])  

misc_stats = tables[5] 
misc_stats_rows = misc_stats.findAll('tr') 
roadteam = str(_unpack(misc_stats_rows[0])[0]).strip() 
hometeam = str(_unpack(misc_stats_rows[0])[1]).strip() 
datefinder = tables[6] 
datefinder_rows = datefinder.findAll('tr') 

date = str(_unpack(datefinder_rows[0])[0]).strip() 
year = 2012 
from dateutil.parser import parse 
parsedDate = parse(date) 
date = parsedDate.replace(year) 
month = parsedDate.month 
day = parsedDate.day 
modDate = str(day)+str(month)+str(year) 
gameid = modDate + roadteam + hometeam 

data = {'roadteam': [roadteam], 
     'hometeam': [hometeam], 
     'roadQ1': [roadteamQ1], 
     'roadQ2': [roadteamQ2], 
     'roadQ3': [roadteamQ3], 
     'roadQ4': [roadteamQ4], 
     'homeQ1': [hometeamQ1], 
     'homeQ2': [hometeamQ2], 
     'homeQ3': [hometeamQ3], 
     'homeQ4': [hometeamQ4]} 

globals()["%s" % gameid] = pd.DataFrame(data) 
df = pd.DataFrame.load('df') 
df = pd.concat([df, globals()["%s" % gameid]]) 
df.save('df') 

我如何可以自动完成这一所以我没有[1]手动手动更改newsoup =页面并刮去所有从第一URL一气呵成链接的boxscores的。我对Python非常陌生,缺乏对基础知识的一些理解。

+0

为什么你必须手动改变呢?所以像页面[2],页面[3],..? –

+0

我只知道如何一次导入它们一个。 – user2333196

回答

1

所以在第一代码框中您收集pages

所以,你必须循环这个第二个代码中,如果我的理解是

for page in pages: 
    soup = BeautifulSoup(page) 
    # rest of the code here 
+0

我会尝试。我需要暂停吗?如果是这样,我该怎么做? – user2333196

+0

暂停?我不知道,为什么你应该这样做。但如果你想要的话,可以使用'raw_input('some prompt:')',这样它就会等到你输入 –