BeautifulSoup，findAll（'table'）返回所有表格，但也返回它们之间的文本

我想隔离网页的一部分，不幸的是它不包含在任何我可以拔出的内容中。BeautifulSoup，findAll（'table'）返回所有表格，但也返回它们之间的文本

我能得到的最接近的是获取整个网页的正文，然后尝试删除表格（这是我不想要的唯一部分）。

我使用的代码：

storyText = soup.body 
toRemove = storyText.findAll('table') 
for each in toRemove: 
    print each

目前最大的问题是，文档，删除行返回表和它们之间包含虽不在他们的文字。

所以我得到：

# Isolate body 
findBody = soup.body 
new = str(findBody) 
# Section off the text from the tables before it. 
sec = new.split('</table>') 
# Select story area 
newStory = sec[3] 
# Section off the text from the tables after it. 
newSec = newStory.split('<table') 
# Select the story area, this the area that we want. 
story = newSec[0]

我仍然在寻找一个答案，因为它似乎应该有一个更干净的方法：

<body> 
<table> 
    table stuff 
</table> 
    Text, not in tags </br> #This is what I want. 
<table> 
    table stuff 
</table 
</body>

我已经做了以下在我的问题的工作去做这个。

来源

2013-09-22 DasSnipez

因此，在您试图获取所有文本的示例页面上？ – Serial

从它开始，是的。 – DasSnipez

您的代码在我的Mac上正常工作。你使用了哪个版本？我用美丽的汤4

（美丽的汤3不建议，因为，它不再被开发http://www.crummy.com/software/BeautifulSoup/bs4/doc/。）

这里是我的代码：

from bs4 import BeautifulSoup 

contents = '''<body> 
<table> 
    table stuff1 
</table> 
    Text, not in tags </br> #This is what I want. 
<table> 
    table stuff2 
</table> 
</body>''' 

soup = BeautifulSoup(contents) 

storyText = soup.body 
toRemove = storyText.findAll('table') 
for each in toRemove: 
    print each 
    each.extract() 

print '----result-------------' 
print soup

以下结果会出。

<table> 
    table stuff1 
</table> 
<table> 
    table stuff2 
</table> 
----result------------- 
<body> 

    Text, not in tags #This is what I want. 

</body>

来源

2013-09-22 05:28:42 lancif

我正在使用BS4，我尝试过使用适合您的代码，但它不会执行我所需的操作，它将删除表格*和*我想从汤中删除的文本。 – DasSnipez

嗯，这可能是由默认的HTML解析器引起的。更改我的代码的13号线如下： ' 汤= BeautifulSoup（内容， 'html5lib'） ' 你可以使用这些命令之一安装html5lib： $ apt-get的安装python-html5lib $ easy_install html5lib $ pip install html5lib – lancif

BeautifulSoup，findAll（'table'）返回所有表格，但也返回它们之间的文本

回答

相关问题