我是新来的beautifulsoup和python,我敢肯定这是一个简单的问题,但我似乎无法解决它。python beautifulsoup循环遍历表格行
我想循环通过一个html表的行,基于“标题”行按糖果类型分组表。我的表看起来像这样:
我想循环获取每个糖果标题下的日期。因此,迭代会得到这样的数据:
第一循环迭代: candy_type:奇巧, 位置:商城1, 计划:63, 实际:0, DIFF:25
第二迭代: candy_type:奇巧, 位置:购物中心2, 计划:7, 实际:0, DIFF:6
......最后一次迭代: candy_type:彩虹糖, 位置:2号楼, 计划:320, 实际:236, DIFF:0
这是表代码:
<TABLE BORDER="1" WIDTH="100%">
<TR>
<TH COLSPAN=4>Candy</TH>
</TR>
<TR BGCOLOR=#CEE3F6>
<TD COLSPAN=4>
<FONT FACE=Arial>
<center><b>KitKat</b></center>
</FONT>
</TD>
</TR>
<TR BGCOLOR=#336699>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD>
</TR>
<TR>
<TD>Mall 1</TD>
<TD>63</TD>
<TD>0</TD>
<TD>25</TD>
</TR>
<TR>
<TD>Mall 2</TD>
<TD>7</TD>
<TD>0</TD>
<TD>6</TD>
</TR>
<TR BGCOLOR=#CEE3F6>
<TD COLSPAN=4>
<FONT FACE=Arial>
<center><b>OH Henry</b></center>
</FONT>
</TD>
</TR>
<TR BGCOLOR=#336699>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD>
</TR>
<TR>
<TD>Warehouse 1</TD>
<TD>195</TD>
<TD>122</TD>
<TD>30</TD>
</TR>
<TR>
<TD>Warehouse 2</TD>
<TD>96</TD>
<TD>76</TD>
<TD>6</TD>
</TR>
<TR BGCOLOR=#CEE3F6>
<TD COLSPAN=4>
<FONT FACE=Arial>
<center><b>Skittles</b></center>
</FONT>
</TD>
</TR>
<TR BGCOLOR=#336699>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD>
</TR>
<TR>
<TD>Building 1</TD>
<TD>120</TD>
<TD>90</TD>
<TD>5</TD>
</TR>
<TR>
<TD>Building 2</TD>
<TD>320</TD>
<TD>236</TD>
<TD>0</TD>
</TR>
</TABLE>
所以我试图
from bs4 import BeautifulSoup
import urllib
readUrl = urllib.urlopen('test.html').read()
soup = BeautifulSoup(readUrl)
candytype = soup.findAll('tr',{"bgcolor" : "#CEE3F6"})
for type in candytype:
print type
这会打印出了三种糖果类型是这样的:
<tr bgcolor="#CEE3F6">
<td colspan="4">
<font face="Arial">
</font><center><b>KitKat</b></center>
</td>
</tr>
<tr bgcolor="#CEE3F6">
<td colspan="4">
<font face="Arial">
</font><center><b>OH Henry</b></center>
</td>
</tr>
<tr bgcolor="#CEE3F6">
<td colspan="4">
<font face="Arial">
</font><center><b>Skittles</b></center>
</td>
</tr>
我以为我可以将糖果“标题”(即标题)分组。 tr元素的bgcolor
设置为#CEE3F6
),然后在此基础上迭代,但我无法弄清楚如何进一步查看数据。
任何想法?
你必须使用'beautifulsoup'吗?我会推荐使用['parsel'](https://github.com/scrapy/parsel) – eLRuLL