2017-02-24 178 views
0

我是新来的beautifulsoup和python,我敢肯定这是一个简单的问题,但我似乎无法解决它。python beautifulsoup循环遍历表格行

我想循环通过一个html表的行,基于“标题”行按糖果类型分组表。我的表看起来像这样: enter image description here

我想循环获取每个糖果标题下的日期。因此,迭代会得到这样的数据:

第一循环迭代: candy_type:奇巧, 位置:商城1, 计划:63, 实际:0, DIFF:25

第二迭代: candy_type:奇巧, 位置:购物中心2, 计划:7, 实际:0, DIFF:6

......最后一次迭代: candy_type:彩虹糖, 位置:2号楼, 计划:320, 实际:236, DIFF:0

这是表代码:

<TABLE BORDER="1" WIDTH="100%"> 
    <TR> 
     <TH COLSPAN=4>Candy</TH> 
    </TR> 
    <TR BGCOLOR=#CEE3F6> 
     <TD COLSPAN=4> 
     <FONT FACE=Arial> 
      <center><b>KitKat</b></center> 
     </FONT> 
     </TD> 
    </TR> 
    <TR BGCOLOR=#336699> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD> 
    </TR> 
    <TR> 
     <TD>Mall 1</TD> 
     <TD>63</TD> 
     <TD>0</TD> 
     <TD>25</TD> 
    </TR> 
    <TR> 
     <TD>Mall 2</TD> 
     <TD>7</TD> 
     <TD>0</TD> 
     <TD>6</TD> 
    </TR> 
    <TR BGCOLOR=#CEE3F6> 
     <TD COLSPAN=4> 
     <FONT FACE=Arial> 
      <center><b>OH Henry</b></center> 
     </FONT> 
     </TD> 
    </TR> 
    <TR BGCOLOR=#336699> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD> 
    </TR> 
    <TR> 
     <TD>Warehouse 1</TD> 
     <TD>195</TD> 
     <TD>122</TD> 
     <TD>30</TD> 
    </TR> 
    <TR> 
     <TD>Warehouse 2</TD> 
     <TD>96</TD> 
     <TD>76</TD> 
     <TD>6</TD> 
    </TR> 
    <TR BGCOLOR=#CEE3F6> 
     <TD COLSPAN=4> 
     <FONT FACE=Arial> 
      <center><b>Skittles</b></center> 
     </FONT> 
     </TD> 
    </TR> 
    <TR BGCOLOR=#336699> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD> 
     <TD><FONT COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD> 
    </TR> 
    <TR> 
     <TD>Building 1</TD> 
     <TD>120</TD> 
     <TD>90</TD> 
     <TD>5</TD> 
    </TR> 
    <TR> 
     <TD>Building 2</TD> 
     <TD>320</TD> 
     <TD>236</TD> 
     <TD>0</TD> 
    </TR> 
</TABLE> 

所以我试图

from bs4 import BeautifulSoup 
import urllib 

readUrl = urllib.urlopen('test.html').read() 
soup = BeautifulSoup(readUrl) 
candytype = soup.findAll('tr',{"bgcolor" : "#CEE3F6"}) 
for type in candytype: 
    print type 

这会打印出了三种糖果类型是这样的:

<tr bgcolor="#CEE3F6"> 
<td colspan="4"> 
<font face="Arial"> 
</font><center><b>KitKat</b></center> 
</td> 
</tr> 
<tr bgcolor="#CEE3F6"> 
<td colspan="4"> 
<font face="Arial"> 
</font><center><b>OH Henry</b></center> 
</td> 
</tr> 
<tr bgcolor="#CEE3F6"> 
<td colspan="4"> 
<font face="Arial"> 
</font><center><b>Skittles</b></center> 
</td> 
</tr> 

我以为我可以将糖果“标题”(即标题)分组。 tr元素的bgcolor设置为#CEE3F6),然后在此基础上迭代,但我无法弄清楚如何进一步查看数据。

任何想法?

+0

你必须使用'beautifulsoup'吗?我会推荐使用['parsel'](https://github.com/scrapy/parsel) – eLRuLL

回答

2

查找所有行,然后遍历它们。当您找到一个包含糖果名称的行(按行的颜色)时,请保留该名称。现在确定该行的下一个兄弟姐妹。跳过第一个,这将是一个标题,但会捕获td元素中的后续文本。当你遇到不同糖果的名字时,你知道你已经找到了最后一个兄弟姐妹(再次是该行的颜色)。

>>> from bs4 import BeautifulSoup 
>>> soup = BeautifulSoup(open('justTable.htm').read(), 'lxml') 
>>> trs = soup.findAll('tr') 
>>> for tr in trs: 
...  if 'bgcolor' in tr.attrs and tr.attrs['bgcolor']=='#CEE3F6': 
...   candy = tr.text.strip() 
...   first = True 
...   for sibs in tr.fetchNextSiblings(): 
...    if first: 
...     first = False 
...     continue 
...    if 'bgcolor' in sibs.attrs and sibs.attrs['bgcolor']=='#CEE3F6': 
...     break 
...    [candy]+sibs.text.strip().split('\n') 
... 
['KitKat', 'Mall 1', '63', '0', '25'] 
['KitKat', 'Mall 2', '7', '0', '6'] 
['OH Henry', 'Warehouse 1', '195', '122', '30'] 
['OH Henry', 'Warehouse 2', '96', '76', '6'] 
['Skittles', 'Building 1', '120', '90', '5'] 
['Skittles', 'Building 2', '320', '236', '0']