在python中抓取网页

我想使用python从this petition中删除所有〜62000个名字。我正在尝试使用beautifulsoup4库。在python中抓取网页

但是，它只是不工作。

这里是我到目前为止的代码：

import urllib2, re 
    from bs4 import BeautifulSoup 

    soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read()) 

divs = soup.findAll('div', attrs={'class' : 'name_location'}) 
print divs 
[]

我在做什么错？另外，我想以某种方式访问下一个页面，将下一组名称添加到列表中，但我现在不知道该怎么做。任何帮助表示赞赏，谢谢。

来源

2013-07-26 cevn

'list'包含什么？另外，请不要使用变量名'list'，因为它会遮盖相同名称的python内建函数，所以scrapy会使每个页面变得微不足道，但涉及使用/学习scrapy框架 – dm03514

只需注意：1）不会看起来网站的AUP允许这样做，并且2）即使你确实在下一页，下一页，下一页等等做了简单的循环，你可能最终会被阻止，因为你将要制作一个地狱的很多请求...为什么不只是通过电子邮件发送并询问您希望的信息是否可能？ –

它不包含任何内容。然后我会更新一下。我现在会尝试给他们发邮件，但我仍然想尝试这个问题。 – cevn

在大多数情况下，只是简单地刮一个网站是非常不恰当的。您在短时间内在网站上投入了相当大的负载，从而减慢了合法用户请求的速度。更不用说窃取他们所有的数据了。

考虑一种替代方法，如询问（礼貌地）转储数据（如上所述）。

或者如果你确实需要刮：

空间使用定时器
刮您的要求巧妙地

我把那个网页快速浏览，并在我看来，他们使用AJAX请求签名。为什么不简单地复制他们的AJAX请求，它很可能会使用某种REST调用。通过这样做，您只需要请求所需的数据即可减轻服务器上的负载。实际处理数据也会更容易，因为它的格式不错。

Reedit，我看着他们的robots.txt文件。它不允许/xml/请尊重这一点。

来源

2013-07-26 16:37:18 Blaine

我很喜欢使用另一种方法，但我不知道如何。您可以帮助我向签名所在的任何地方提出请求吗？我发送了一封电子邮件无济于事。 – cevn

你是什么意思不工作？空的列表或错误？

如果您收到一个空列表，这是因为文档中不存在“name_location”类。也可以结帐bs4的文档findAll

来源

2013-07-26 16:30:53

这是一个空的列表。当我检查Chrome中的元素时，该类似乎存在，这很奇怪，因为当我查看源代码时，现在您提到它。 – cevn

你可以尝试这样的事：

import urllib2 
from bs4 import BeautifulSoup 

html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495') 

# uncomment to try with a smaller subset of the signatures 
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml') 

results = [] 
while True: 
    # Read the web page in XML mode 
    soup = BeautifulSoup(html.read(), "xml") 

    try: 
     for s in soup.find_all("signature"): 
      # Scrape the names from the XML 
        firstname = s.find('firstname').contents[0] 
      lastname = s.find('lastname').contents[0] 
      results.append(str(firstname) + " " + str(lastname)) 
    except: 
     pass 

    # Find the next page to scrape 
    prev = soup.find("prev_signature") 

    # Check if another page of result exists - if not break from loop 
    if prev == None: 
     break 

    # Get the previous URL 
    url = prev.contents[0] 

    # Open the next page of results 
    html = urllib2.urlopen(url) 
    print("Extracting data from {}".format(url)) 

# Print the results 
print("\n") 
print("====================") 
print("= Printing Results =") 
print("====================\n") 
print(results)

被警告，虽然有很多的数据有要经过的，我不知道这是否是对网站的服务条款，所以你会需要检查出来。

来源

2013-07-26 18:55:17 Hayden

在python中抓取网页

回答

相关问题