2009-12-01 87 views
2

我可以将这两个块合并为一个:我可以将两个'findAll'搜索块合并成一个吗?

编辑:除了像Yacoby合并循环以外的其他方法。

for tag in soup.findAll(['script', 'form']): 
    tag.extract() 

for tag in soup.findAll(id="footer"): 
    tag.extract() 

也可以予多个块到一个:

for tag in soup.findAll(id="footer"): 
    tag.extract() 

for tag in soup.findAll(id="content"): 
    tag.extract() 

for tag in soup.findAll(id="links"): 
    tag.extract() 

,或者可以是有一些lambda表达式,我可以检查是否在阵列,或任何其它更简单的方法。

而且我怎么找到属性类的标签,如类保留关键字:

编辑:这部分是由soup.findAll(ATTRS = {:“NOPRINT”“类”}):解决

for tag in soup.findAll(class="noprint"): 
    tag.extract() 
+0

如果你只发布每个问题的一个问题,你会得到更好的结果 – hop 2009-12-01 10:03:26

回答

7

你可以通过函数来​​.findall()这样的:

soup.findAll(lambda tag: tag.name in ['script', 'form'] or tag['id'] == "footer") 

但你可能是首先建立的标签列表,然后遍历它更好:

tags = soup.findAll(['script', 'form']) 
tags.extend(soup.findAll(id="footer")) 

for tag in tags: 
    tag.extract() 

如果你要筛选一些id S,你可以使用:

for tag in soup.findAll(lambda tag: tag.has_key('id') and 
            tag['id'] in ['footer', 'content', 'links']): 
    tag.extract() 

更具体的方法是将一个lambda分配给id参数:

for tag in soup.findAll(id=lambda value: value in ['footer', 'content', 'links']): 
    tag.extract() 
+0

我收到错误:SyntaxError:无效的语法 – 2009-12-01 10:47:08

+0

SyntaxError?奇怪...你应该得到一个TypeError。 – hop 2009-12-01 10:51:27

+0

在soup.findAll固定的类型错误 – hop 2009-12-01 11:03:08

4

我不知道是否能BeautifulSoup更优雅做到这一点,但你可以合并这两个循环,像这样:

for tag in soup.findAll(['script', 'form']) + soup.findAll(id="footer"): 
    tag.extract() 

你可以找到像这样的类(Documentation):

for tag in soup.findAll(attrs={'class': 'noprint'}): 
    tag.extract() 
+0

它的工作良好,但看起来并不干净结合长循环... + ... + ... + ... + .. 。+ ... + ... + ...还有其他更好的方法吗? – 2009-12-01 10:33:30

0

回答你问题的第二部分是那里documentation

Searching by CSS class

The attrs argument would be a pretty obscure feature were it not for one thing: CSS. It's very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, class, is also a Python reserved word.

You could search by CSS class with soup.find("tagName", { "class" : "cssClass" }), but that's a lot of code for such a common operation. Instead, you can pass a string for attrs instead of a dictionary. The string will be used to restrict the CSS class.

from BeautifulSoup import BeautifulSoup 
soup = BeautifulSoup("""Bob's <b>Bold</b> Barbeque Sauce now available in 
        <b class="hickory">Hickory</b> and <b class="lime">Lime</a>""") 

soup.find("b", { "class" : "lime" }) 
# <b class="lime">Lime</b> 

soup.find("b", "hickory") 
# <b class="hickory">Hickory</b> 
0
links = soup.find_all('a',class_='external') ,we can pass class_ to filter based on class values 

from bs4 import BeautifulSoup 
from urllib.request import urlopen 

with urlopen('http://www.espncricinfo.com/') as f: 
    raw_data= f.read() 
    soup= BeautifulSoup(raw_data,'lxml') 
    # print(soup) 
    links = soup.find_all('a',class_='external') 
    for link in links: 
     print(link) 
相关问题