如何更有效地利用Urllib2进行刮擦？

这里是新手。我使用urllib2编写了一个简单的脚本来浏览Billboard.com，并从1958年到2013年每周为首歌和歌手刮掉。问题是速度很慢 - 完成需要几个小时。如何更有效地利用Urllib2进行刮擦？

我想知道瓶颈在哪里，如果有一种方法可以更有效地利用Urllib2或者如果我需要使用更复杂的工具？

import re 
import urllib2 
array = [] 
url = 'http://www.billboard.com/charts/1958-08-09/hot-100' 
date = "" 
while date != '2013-07-13': 
    response = urllib2.urlopen(url) 
    htmlText = response.read() 
    date = re.findall('\d\d\d\d-\d\d-\d\d',url)[0] 
    song = re.findall('<h1>.*</h1>', htmlText)[0] 
    song = song[4:-5] 
    artist = re.findall('/artist.*</a>', htmlText)[1] 
    artist = re.findall('>.*<', artist)[0] 
    artist = artist[1:-1] 
    nextWeek = re.findall('href.*>Next', htmlText)[0] 
    nextWeek = nextWeek[5:-5] 
    array.append([date, song, artist]) 
    url = 'http://www.billboard.com' + nextWeek 
print array

来源

2013-07-16 Ben Davidow

[Scrapy（https://scrapy.readthedocs.org/）将执行好多了，它的该工作的工具，当然。让我知道你是否可以 - 我会为你写一个蜘蛛样本。 – alecxe

改进将包括不使用urllib2，不使用正则表达式来解析html，并使用多个线程来执行您的I/O。 – roippi

我真诚怀疑'urllib2'与任何效率问题都有关系。它所做的只是发出请求并拉下回应;有99.99％的时间是网络时间，没有其他方法可以改善。问题是（a）你的解析代码可能会很慢，（b）你可能会做很多重复或不必要的下载，（c）你需要并行下载（你可以用'urllib2' （d）您需要更快的网络连接，或者（e）billboard.com正在限制您的工作。 – abarnert

下面是使用Scrapy的解决方案。看看在overview，你就会明白，它是专为这种任务的工具：

它正迅速（基于双绞线）
易于使用和理解
建-in提取基于XPath的机制（可以使用bs或lxml太虽然）
内置支持流水线提取的项目数据库，XML，JSON无论
和更多的功能

这里的工作蜘蛛，提取你问的一切（15分钟的工作对我，而老，笔记本电脑）：

import datetime 
from scrapy.item import Item, Field 
from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 


class BillBoardItem(Item): 
    date = Field() 
    song = Field() 
    artist = Field() 


BASE_URL = "http://www.billboard.com/charts/%s/hot-100" 


class BillBoardSpider(BaseSpider): 
    name = "billboard_spider" 
    allowed_domains = ["billboard.com"] 

    def __init__(self): 
     date = datetime.date(year=1958, month=8, day=9) 

     self.start_urls = [] 
     while True: 
      if date.year >= 2013: 
       break 

      self.start_urls.append(BASE_URL % date.strftime('%Y-%m-%d')) 
      date += datetime.timedelta(days=7) 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     date = hxs.select('//span[@class="chart_date"]/text()').extract()[0] 

     songs = hxs.select('//div[@class="listing chart_listing"]/article') 
     for song in songs: 
      item = BillBoardItem() 
      item['date'] = date 
      try: 
       item['song'] = song.select('.//header/h1/text()').extract()[0] 
       item['artist'] = song.select('.//header/p[@class="chart_info"]/a/text()').extract()[0] 
      except: 
       continue 

      yield item

保存成billboard.py，并通过scrapy runspider billboard.py -o output.json运行。然后，在output.json你会看到：

... 
{"date": "September 20, 1958", "artist": "Domenico Modugno", "song": "Nel Blu Dipinto Di Blu (Volar\u00c3\u00a9)"} 
{"date": "September 20, 1958", "artist": "The Everly Brothers", "song": "Bird Dog"} 
{"date": "September 20, 1958", "artist": "The Elegants", "song": "Little Star"} 
{"date": "September 20, 1958", "artist": "Tommy Edwards", "song": "It's All In The Game"} 
{"date": "September 20, 1958", "artist": "Jimmy Clanton And His Rockets", "song": "Just A Dream"} 
{"date": "September 20, 1958", "artist": "Poni-Tails", "song": "Born Too Late"} 
{"date": "September 20, 1958", "artist": "The Olympics", "song": "Western Movies"} 
{"date": "September 20, 1958", "artist": "Little Anthony And The Imperials", "song": "Tears On My Pillow"} 
{"date": "September 20, 1958", "artist": "Robin Luke", "song": "Susie Darlin'"} 
{"date": "September 27, 1958", "artist": "Domenico Modugno", "song": "Nel Blu Dipinto Di Blu (Volar\u00c3\u00a9)"} 
{"date": "September 27, 1958", "artist": "The Everly Brothers", "song": "Bird Dog"} 
{"date": "September 27, 1958", "artist": "Tommy Edwards", "song": "It's All In The Game"} 
{"date": "September 27, 1958", "artist": "The Elegants", "song": "Little Star"} 
{"date": "September 27, 1958", "artist": "Jimmy Clanton And His Rockets", "song": "Just A Dream"} 
{"date": "September 27, 1958", "artist": "Little Anthony And The Imperials", "song": "Tears On My Pillow"} 
{"date": "September 27, 1958", "artist": "Robin Luke", "song": "Susie Darlin'"} 
...

而且，看看grequests作为替代工具。

希望有所帮助。

来源

2013-07-16 20:13:19 alecxe

@ Ben321考虑接受答案，如果它值得，谢谢。 – alecxe

您的瓶颈几乎肯定是从网站获取数据。每个网络请求都有延迟，这会阻止其他任何事情同时发生。您应该考虑跨多个线程分割请求，以便一次发送多个请求。基本上，你的性能是I/O绑定的，而不是CPU绑定的。

这是一个从头开始构建的简单解决方案，以便您可以看到爬行程序通常如何工作。从长远来看，使用类似Scrapy的东西可能是最好的，但我发现从简单明了开始总是有帮助的。

import threading 
import Queue 
import time 
import datetime 
import urllib2 
import re 

class Crawler(threading.Thread): 
    def __init__(self, thread_id, queue): 
     threading.Thread.__init__(self) 
     self.thread_id = thread_id 
     self.queue = queue 

     # let's use threading events to tell the thread when to exit 
     self.stop_request = threading.Event() 

    # this is the function which will run when the thread is started 
    def run(self): 
     print 'Hello from thread %d! Starting crawling...' % self.thread_id 

     while not self.stop_request.isSet(): 
      # main crawl loop 

      try: 
       # attempt to get a url target from the queue 
       url = self.queue.get_nowait() 
      except Queue.Empty: 
       # if there's nothing on the queue, sleep and continue 
       time.sleep(0.01) 
       continue 

      # we got a url, so let's scrape it! 
      response = urllib2.urlopen(url) # might want to consider adding a timeout here 
      htmlText = response.read() 

      # scraping with regex blows. 
      # consider using xpath after parsing the html using lxml.html module 
      song = re.findall('<h1>.*</h1>', htmlText)[0] 
      song = song[4:-5] 
      artist = re.findall('/artist.*</a>', htmlText)[1] 
      artist = re.findall('>.*<', artist)[0] 
      artist = artist[1:-1] 

      print 'thread %d found artist:', (self.thread_id, artist) 

    # we're overriding the default join function for the thread so 
    # that we can make sure it stops 
    def join(self, timeout=None): 
     self.stop_request.set() 
     super(Crawler, self).join(timeout) 

if __name__ == '__main__': 
    # how many threads do you want? more is faster, but too many 
    # might get your IP blocked or even bring down the site (DoS attack) 
    n_threads = 10 

    # use a standard queue object (thread-safe) for communication 
    queue = Queue.Queue() 

    # create our threads 
    threads = [] 
    for i in range(n_threads): 
     threads.append(Crawler(i, queue)) 

    # generate the urls and fill the queue 
    url_template = 'http://www.billboard.com/charts/%s/hot-100' 
    start_date = datetime.datetime(year=1958, month=8, day=9) 
    end_date = datetime.datetime(year=1959, month=9, day=5) 
    delta = datetime.timedelta(weeks=1) 

    week = 0 
    date = start_date + delta*week 
    while date <= end_date: 
     url = url_template % date.strftime('%Y-%m-%d') 
     queue.put(url) 
     week += 1 
     date = start_date + delta*week 

    # start crawling! 
    for t in threads: 
     t.start() 

    # wait until the queue is empty 
    while not queue.empty(): 
     time.sleep(0.01) 

    # kill the threads 
    for t in threads: 
     t.join()

来源

2013-07-16 19:50:31

也许可以更好地解释使用并发性来提高CPU性能（并行性）和使用并发性来更加彻底地提高数据吞吐量或响应能力（您在此处执行的并发性的种类）之间的差异，这样做对OP有更深入的理解为什么这个工程。 – Wes

非常有帮助和很好的解释Brendan，谢谢！ –

优秀的答案@BrendanWood。找到该队列的并发性绝对是做到这一点的方法。有50个并发线程（在我的家庭计算机/网络上进行测试是极限），这大概需要10分钟。真棒！ – w00tw00t111

-1

选项1： 使用Threads向服务器发出“同时”请求。

选项2： 分发到多台机器工作，最好的解决办法是使用Storm

来源

2013-07-16 20:15:06 Vor

async io怎么样？ – dpn

如何更有效地利用Urllib2进行刮擦？

回答

相关问题