使用BeautifulSoup解析<tr>标签，有麻烦提取值

我有一些HTML看起来像这样：使用BeautifulSoup解析<tr>标签，有麻烦提取值

<tr> 
    <td>some text</td> 
    <td>some other text</td> 
    <td>some <b>problematic</b> other <br /> text</td> 
</tr>

和一些Python它试图抓住标签的值并打印出每个内在价值：

soup = BeautifulSoup(data, convertEntities=BeautifulSoup.HTML_ENTITIES) 
for row in soup.findAll('tr'): 
    print repr(row) # this prints the whole 'tr' element text just fine. 
    for col in row.contents: 
     print col.string

所以全文正确打印拍摄的HTML，但“关口”打印无最后一个元素：

some text 
some other text 
None

我并不熟悉BeatifulSoup或python，但它似乎是最后一个元素的内部标签导致解析问题？

感谢

来源

2013-03-13 user291701

你可以升级到BeautifulSoup版本4，并使用.stripped_strings：

soup = BeautifulSoup(data) 
for row in soup.find_all('tr'): 
    print '\n'.join(row.stripped_strings)

在BeautifulSoup 3，您需要搜索所有包含的文本而不是：

for row in soup.findAll('tr'): 
    print '\n'.join(el.strip() for row.findAll(text=True) if el.strip())

来源

2013-03-13 16:13:49

使用BeautifulSoup解析<tr>标签，有麻烦提取值

回答

相关问题