2012-01-04 230 views

回答

31

你也可以使用findAll获取列表中的所有行之后,仅仅用切片语法访问您需要的元素:

rows = soup.findAll('tr')[4::5] 
+0

这很干净。注意find all方法返回一个数组,所以这很好。 – JasTonAChair 2015-11-06 02:51:00

1

作为一个通用的解决方案,你可以转换表到嵌套列表和迭代...

import BeautifulSoup 

def listify(table): 
    """Convert an html table to a nested list""" 
    result = [] 
    rows = table.findAll('tr') 
    for row in rows: 
    result.append([]) 
    cols = row.findAll('td') 
    for col in cols: 
     strings = [_string.encode('utf8') for _string in col.findAll(text=True)] 
     text = ''.join(strings) 
     result[-1].append(text) 
    return result 

if __name__=="__main__": 
    """Build a small table with one column and ten rows, then parse into a list""" 
    htstring = """<table> <tr> <td>foo1</td> </tr> <tr> <td>foo2</td> </tr> <tr> <td>foo3</td> </tr> <tr> <td>foo4</td> </tr> <tr> <td>foo5</td> </tr> <tr> <td>foo6</td> </tr> <tr> <td>foo7</td> </tr> <tr> <td>foo8</td> </tr> <tr> <td>foo9</td> </tr> <tr> <td>foo10</td> </tr></table>""" 
    soup = BeautifulSoup.BeautifulSoup(htstring) 
    for idx, ii in enumerate(listify(soup)): 
     if ((idx+1)%5>0): 
      continue 
     print ii 

运行的是......

[[email protected] ~]$ python testme.py 
['foo5'] 
['foo10'] 
[[email protected] ~]$ 
1

另一种选择,如果你喜欢原始的HTML ...

"""Build a small table with one column and ten rows, then parse it into a list""" 
htstring = """<table> <tr> <td>foo1</td> </tr> <tr> <td>foo2</td> </tr> <tr> <td>foo3</td> </tr> <tr> <td>foo4</td> </tr> <tr> <td>foo5</td> </tr> <tr> <td>foo6</td> </tr> <tr> <td>foo7</td> </tr> <tr> <td>foo8</td> </tr> <tr> <td>foo9</td> </tr> <tr> <td>foo10</td> </tr></table>""" 
result = [html_tr for idx, html_tr in enumerate(soup.findAll('tr')) \ 
    if (idx+1)%5==0] 
print result 

运行的是......

[[email protected] ~]$ python testme.py 
[<tr> <td>foo5</td> </tr>, <tr> <td>foo10</td> </tr>] 
[[email protected] ~]$ 
1

这可以用select轻松完成美丽的汤,如果你知道行号来选择。 (注:这是在BS4)

row = 5 
while true 
    element = soup.select('tr:nth-of-type('+ row +')') 
    if len(element) > 0: 
     # element is your desired row element, do what you want with it 
     row += 5 
    else: 
     break