用BS4解析HTML表格

我一直在尝试不同的方法从这个网站上抓取数据（http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=WR&college=），并且似乎无法让他们工作。我试着玩指数，但似乎无法使它工作。我认为在这一点上我已经尝试了太多的东西，所以如果有人能指出我朝着正确的方向，我会非常感激。用BS4解析HTML表格

我想拉出所有信息并将其导出到.csv文件，但此时我只是试图获取要打印的名称和位置以便开始使用。

这里是我的代码：

import urllib2 
from bs4 import BeautifulSoup 
import re 

url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=') 

page = urllib2.urlopen(url).read() 

soup = BeautifulSoup(page) 
table = soup.find('table') 

for row in table.findAll('tr')[0:]: 
    col = row.findAll('tr') 
    name = col[1].string 
    position = col[3].string 
    player = (name, position) 
    print "|".join(player)

这里是我得到的错误：线14，在名称= COL [1] .string IndexError：列表索引超出范围。

--UPDATE--

好吧，我做了一个小的进步。它现在允许我从头到尾去做，但它需要知道表中有多少行。我如何才能把它贯穿到底？更新的代码：

import urllib2 
from bs4 import BeautifulSoup 
import re 

url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=') 

page = urllib2.urlopen(url).read() 

soup = BeautifulSoup(page) 
table = soup.find('table') 


for row in table.findAll('tr')[1:250]: 
    col = row.findAll('td') 
    name = col[1].getText() 
    position = col[3].getText() 
    player = (name, position) 
    print "|".join(player)

来源

2014-02-27 ISuckAtLife

我只在8个小时左右就知道了。学习很有趣。感谢凯文的帮助！它现在包含将抓取的数据输出到csv文件的代码。接下来是采取这一数据，并过滤掉某些职位....

这里是我的代码：

import urllib2 
from bs4 import BeautifulSoup 
import csv 

url = ('http://nflcombineresults.com/nflcombinedata.php?year=2000&pos=&college=') 

page = urllib2.urlopen(url).read() 

soup = BeautifulSoup(page) 
table = soup.find('table') 

f = csv.writer(open("2000scrape.csv", "w")) 
f.writerow(["Name", "Position", "Height", "Weight", "40-yd", "Bench", "Vertical", "Broad", "Shuttle", "3-Cone"]) 
# variable to check length of rows 
x = (len(table.findAll('tr')) - 1) 
# set to run through x 
for row in table.findAll('tr')[1:x]: 
    col = row.findAll('td') 
    name = col[1].getText() 
    position = col[3].getText() 
    height = col[4].getText() 
    weight = col[5].getText() 
    forty = col[7].getText() 
    bench = col[8].getText() 
    vertical = col[9].getText() 
    broad = col[10].getText() 
    shuttle = col[11].getText() 
    threecone = col[12].getText() 
    player = (name, position, height, weight, forty, bench, vertical, broad, shuttle, threecone,) 
    f.writerow(player)

来源

2014-02-28 12:50:10 ISuckAtLife

我不能由于防火墙的权限运行脚本，但我相信这个问题是在这条线：

col = row.findAll('tr')

row是tr标签，而你要求BeautifulSoup找到tr标签内的所有tr标签。你大概的意思做：

col = row.findAll('td')

此外，由于实际的文本没有直接的TDS内部，但也隐藏嵌套div S和a秒钟内，它可能是使用getText方法有用而不是.string：

name = col[1].getText() 
position = col[3].getText()

来源

2014-02-27 19:52:04 Kevin

啊，这是有道理的。谢谢！好吧，我做了你所建议的改变，并且在页面上打印大部分结果的时候肯定会取得进展。它始于Adrian Dingle，但不是列中的第一个名字，而是在包含|后打印完整列表和位置。然后它返回这个错误：文件“nfltest.py”，第14行，在 name = col [1] .getText（）IndexError：列表索引超出范围。再一次，我试着玩索引，似乎无法摆脱错误。这只是我，还是这个表奇怪的格式？ – ISuckAtLife

用BS4解析HTML表格

回答

相关问题