2010-01-13 185 views
17

我试图解析来自该网站的信息(HTML表格):http://www.511virginia.org/RoadConditions.aspx?j=All&r=1BeautifulSoup HTML表格解析

目前我使用BeautifulSoup,我有这个样子的

from mechanize import Browser 
from BeautifulSoup import BeautifulSoup 

mech = Browser() 

url = "http://www.511virginia.org/RoadConditions.aspx?j=All&r=1" 
page = mech.open(url) 

html = page.read() 
soup = BeautifulSoup(html) 

table = soup.find("table") 

rows = table.findAll('tr')[3] 

cols = rows.findAll('td') 

roadtype = cols[0].string 
start = cols.[1].string 
end = cols[2].string 
condition = cols[3].string 
reason = cols[4].string 
update = cols[5].string 

entry = (roadtype, start, end, condition, reason, update) 

print entry 

的问题是与代码开始和结束列。他们只是打印为“无”

输出:

(u'Rt. 613N (Giles County)', None, None, u'Moderate', u'snow or ice', u'01/13/2010 10:50 AM') 

我知道他们得到存储在列名单,但似乎额外的链接标签被搞乱了原始的HTML看解析像这样:

<td headers="road-type" class="ConditionsCellText">Rt. 613N (Giles County)</td> 
<td headers="start" class="ConditionsCellText"><a href="conditions.aspx?lat=37.43036753&long=-80.51118005#viewmap">Big Stony Ck Rd; Rt. 635E/W (Giles County)</a></td> 
<td headers="end" class="ConditionsCellText"><a href="conditions.aspx?lat=37.43036753&long=-80.51118005#viewmap">Cabin Ln; Rocky Mount Rd; Rt. 721E/W (Giles County)</a></td> 
<td headers="condition" class="ConditionsCellText">Moderate</td> 
<td headers="reason" class="ConditionsCellText">snow or ice</td> 
<td headers="update" class="ConditionsCellText">01/13/2010 10:50 AM</td> 

那么应该怎么印的是:

(u'Rt. 613N (Giles County)', u'Big Stony Ck Rd; Rt. 635E/W (Giles County)', u'Cabin Ln; Rocky Mount Rd; Rt. 721E/W (Giles County)', u'Moderate', u'snow or ice', u'01/13/2010 10:50 AM') 

任何suggesti感谢您的帮助,并感谢您的提前。

+0

非常感谢你 –

+0

你不必为此使用美丽的汤。你可以使用python3 htmlparser:https://github.com/schmijos/html-table-parser-python3/blob/master/html_table_parser/parser.py – schmijos

回答

32
start = cols[1].find('a').string 

或简单

start = cols[1].a.string 

或更好

start = str(cols[1].find(text=True)) 

entry = [str(x) for x in cols.findAll(text=True)] 
+0

我用str(cols ...)方法去了。谢谢。 –

+21

欢迎)如果你接受了一个答案,如果你觉得它有帮助,这将是一件好事 –

+1

我同意,@Stephon Tanner将返回并接受这个答案 – Neil

2

我试图重现你的错误,但源HTML页面被更改。

关于错误,我也有类似的问题,试图重现例子here

变化所提出的网址为a Wikipedia Table

我固定它移动到BeautifulSoup4

from bs4 import BeautifulSoup 

和改变.string for .get_text()

start = cols[1].get_text() 

我无法用您的示例进行测试(正如我之前所说,我无法重现该错误),但我认为这对于人们正在寻找解决此问题的方法可能很有用。