从BeautifulSoup解析HTML中删除标签

我是python的新手，我正在使用BeautifulSoup解析网站，然后提取数据。我有以下代码：从BeautifulSoup解析HTML中删除标签

for line in raw_data: #raw_data is the parsed html separated into smaller blocks 
    d = {} 
    d['name'] = line.find('div', {'class':'torrentname'}).find('a') 
    print d['name'] 

<a href="/ubuntu-9-10-desktop-i386-t3144211.html"> 
<strong class="red">Ubuntu</strong> 9.10 desktop (i386)</a>

通常情况下，我将能够提取物 '的Ubuntu 9.10桌面（I386）' 通过写

d['name'] = line.find('div', {'class':'torrentname'}).find('a').string

，但由于强烈的html标签返回None。有没有办法提取强标签，然后使用.string或有更好的方法吗？我曾尝试使用BeautifulSoup的extract（）函数，但是我无法使其工作。

编辑：我刚刚意识到，如果有两组强标记因为这两个词之间的空白被遗漏，我的解决方案不起作用。什么是解决这个问题的方法？

来源

2010-08-27 FlowofSoul

相关：http://stackoverflow.com/questions/598817/python-html-removal/599080＃599080 – jfs 2011-01-09 22:06:19

使用 “的.text” 属性：

d['name'] = line.find('div', {'class':'torrentname'}).find('a').text

还是做的findAll联接（文= TRUE）：

anchor = line.find('div', {'class':'torrentname'}).find('a') 
d['name'] = ''.join(anchor.findAll(text=True))

来源

2010-08-29 03:54:02

这不起作用。它不会像这样保持空格： Ubuntu Linux。它以UbuntuLinux的形式出现。 – FlowofSoul 2010-08-29 04:24:05

我已经用附加选项更新了答案。 – 2010-08-29 05:29:17

非常感谢，非常棒！你能解释第二行代码的工作原理吗？ – FlowofSoul 2010-08-29 15:29:33

从BeautifulSoup解析HTML中删除标签

回答

相关问题