BeautifulSoup：解析页

我想解析HTML页面的部分中的一部分，说BeautifulSoup：解析页

my_string = """ 
<p>Some text. Some text. Some text. Some text. Some text. Some text. 
    <a href="#">Link1</a> 
    <a href="#">Link2</a> 
</p> 
<img src="image.png" /> 
<p>One more paragraph</p> 
"""

我这个字符串传递给BeautifulSoup：

soup = BeautifulSoup(my_string) 
# add rel="nofollow" to <a> tags 
# return comment to the template

但在解析BeautifulSoup增加<html>， <head>和<body>标签（如果使用lxml或html5lib解析器），并且我不需要这些代码。我现在发现的唯一方法是避免使用html.parser。

我不知道是否有办法摆脱冗余标签使用lxml - 最快的解析器。

UPDATE

本来我的问题是问不正确。现在我从我的示例中删除了<div>包装，因为普通用户不使用此标记。出于这个原因，我们不能使用.extract()方法来摆脱<html>,<head>和<body>标签。

来源

2012-06-30 Vlad T.

您是否尝试过使用MinimalSoup代替BeautifulSoup？（相同的库，不同的构造函数）。对这种事情应该不那么严格。 –

我试过，但我不明白它是如何工作的。 –

我可以用.contents物业解决问题那''.join(soup.body.contents)会更整齐的列表来转换字符串，但这不起作用，我得到

TypeError: sequence item 0: expected string, Tag found

来源

2012-07-11 22:39:52

LXML会随时添加这些标签，但你可以使用Tag.extract()从里面他们删除您<div>标签：

try: 
    children = soup.body.contents 
    string = '' 
    for child in children: 
     string += str(item) 
    return string 
except AttributeError: 
    return str(soup)

我想：

comment = soup.body.div.extract()

来源

2012-07-01 15:19:50

使用

soup.body.renderContents()

来源

2012-12-05 09:22:00

BeautifulSoup：解析页

回答

相关问题