2013-07-16 32 views
3

这里是新手。我使用urllib2编写了一个简单的脚本来浏览Billboard.com,并从1958年到2013年每周为首歌和歌手刮掉。问题是速度很慢 - 完成需要几个小时。如何更有效地利用Urllib2进行刮擦?

我想知道瓶颈在哪里,如果有一种方法可以更有效地利用Urllib2或者如果我需要使用更复杂的工具?

import re 
import urllib2 
array = [] 
url = 'http://www.billboard.com/charts/1958-08-09/hot-100' 
date = "" 
while date != '2013-07-13': 
    response = urllib2.urlopen(url) 
    htmlText = response.read() 
    date = re.findall('\d\d\d\d-\d\d-\d\d',url)[0] 
    song = re.findall('<h1>.*</h1>', htmlText)[0] 
    song = song[4:-5] 
    artist = re.findall('/artist.*</a>', htmlText)[1] 
    artist = re.findall('>.*<', artist)[0] 
    artist = artist[1:-1] 
    nextWeek = re.findall('href.*>Next', htmlText)[0] 
    nextWeek = nextWeek[5:-5] 
    array.append([date, song, artist]) 
    url = 'http://www.billboard.com' + nextWeek 
print array 
+0

[Scrapy(https://scrapy.readthedocs.org/)将执行好多了,它的该工作的工具,当然。让我知道你是否可以 - 我会为你写一个蜘蛛样本。 – alecxe

+0

改进将包括不使用urllib2,不使用正则表达式来解析html,并使用多个线程来执行您的I/O。 – roippi

+0

我真诚怀疑'urllib2'与任何效率问题都有关系。它所做的只是发出请求并拉下回应;有99.99%的时间是网络时间,没有其他方法可以改善。问题是(a)你的解析代码可能会很慢,(b)你可能会做很多重复或不必要的下载,(c)你需要并行下载(你可以用'urllib2' (d)您需要更快的网络连接,或者(e)billboard.com正在限制您的工作。 – abarnert

回答

2

下面是使用Scrapy的解决方案。看看在overview,你就会明白,它是专为这种任务的工具:

  • 它正迅速(基于双绞线)
  • 易于使用和理解
  • 建-in提取基于XPath的机制(可以使用bslxml太虽然)
  • 内置支持流水线提取的项目数据库,XML,JSON无论
  • 和更多的功能

这里的工作蜘蛛,提取你问的一切(15分钟的工作对我,而老,笔记本电脑):

import datetime 
from scrapy.item import Item, Field 
from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 


class BillBoardItem(Item): 
    date = Field() 
    song = Field() 
    artist = Field() 


BASE_URL = "http://www.billboard.com/charts/%s/hot-100" 


class BillBoardSpider(BaseSpider): 
    name = "billboard_spider" 
    allowed_domains = ["billboard.com"] 

    def __init__(self): 
     date = datetime.date(year=1958, month=8, day=9) 

     self.start_urls = [] 
     while True: 
      if date.year >= 2013: 
       break 

      self.start_urls.append(BASE_URL % date.strftime('%Y-%m-%d')) 
      date += datetime.timedelta(days=7) 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     date = hxs.select('//span[@class="chart_date"]/text()').extract()[0] 

     songs = hxs.select('//div[@class="listing chart_listing"]/article') 
     for song in songs: 
      item = BillBoardItem() 
      item['date'] = date 
      try: 
       item['song'] = song.select('.//header/h1/text()').extract()[0] 
       item['artist'] = song.select('.//header/p[@class="chart_info"]/a/text()').extract()[0] 
      except: 
       continue 

      yield item 

保存成billboard.py,并通过scrapy runspider billboard.py -o output.json运行。然后,在output.json你会看到:

... 
{"date": "September 20, 1958", "artist": "Domenico Modugno", "song": "Nel Blu Dipinto Di Blu (Volar\u00c3\u00a9)"} 
{"date": "September 20, 1958", "artist": "The Everly Brothers", "song": "Bird Dog"} 
{"date": "September 20, 1958", "artist": "The Elegants", "song": "Little Star"} 
{"date": "September 20, 1958", "artist": "Tommy Edwards", "song": "It's All In The Game"} 
{"date": "September 20, 1958", "artist": "Jimmy Clanton And His Rockets", "song": "Just A Dream"} 
{"date": "September 20, 1958", "artist": "Poni-Tails", "song": "Born Too Late"} 
{"date": "September 20, 1958", "artist": "The Olympics", "song": "Western Movies"} 
{"date": "September 20, 1958", "artist": "Little Anthony And The Imperials", "song": "Tears On My Pillow"} 
{"date": "September 20, 1958", "artist": "Robin Luke", "song": "Susie Darlin'"} 
{"date": "September 27, 1958", "artist": "Domenico Modugno", "song": "Nel Blu Dipinto Di Blu (Volar\u00c3\u00a9)"} 
{"date": "September 27, 1958", "artist": "The Everly Brothers", "song": "Bird Dog"} 
{"date": "September 27, 1958", "artist": "Tommy Edwards", "song": "It's All In The Game"} 
{"date": "September 27, 1958", "artist": "The Elegants", "song": "Little Star"} 
{"date": "September 27, 1958", "artist": "Jimmy Clanton And His Rockets", "song": "Just A Dream"} 
{"date": "September 27, 1958", "artist": "Little Anthony And The Imperials", "song": "Tears On My Pillow"} 
{"date": "September 27, 1958", "artist": "Robin Luke", "song": "Susie Darlin'"} 
... 

而且,看看grequests作为替代工具。

希望有所帮助。

+0

@ Ben321考虑接受答案,如果它值得,谢谢。 – alecxe

2

您的瓶颈几乎肯定是从网站获取数据。每个网络请求都有延迟,这会阻止其他任何事情同时发生。您应该考虑跨多个线程分割请求,以便一次发送多个请求。基本上,你的性能是I/O绑定的,而不是CPU绑定的。

这是一个从头开始构建的简单解决方案,以便您可以看到爬行程序通常如何工作。从长远来看,使用类似Scrapy的东西可能是最好的,但我发现从简单明了开始总是有帮助的。

import threading 
import Queue 
import time 
import datetime 
import urllib2 
import re 

class Crawler(threading.Thread): 
    def __init__(self, thread_id, queue): 
     threading.Thread.__init__(self) 
     self.thread_id = thread_id 
     self.queue = queue 

     # let's use threading events to tell the thread when to exit 
     self.stop_request = threading.Event() 

    # this is the function which will run when the thread is started 
    def run(self): 
     print 'Hello from thread %d! Starting crawling...' % self.thread_id 

     while not self.stop_request.isSet(): 
      # main crawl loop 

      try: 
       # attempt to get a url target from the queue 
       url = self.queue.get_nowait() 
      except Queue.Empty: 
       # if there's nothing on the queue, sleep and continue 
       time.sleep(0.01) 
       continue 

      # we got a url, so let's scrape it! 
      response = urllib2.urlopen(url) # might want to consider adding a timeout here 
      htmlText = response.read() 

      # scraping with regex blows. 
      # consider using xpath after parsing the html using lxml.html module 
      song = re.findall('<h1>.*</h1>', htmlText)[0] 
      song = song[4:-5] 
      artist = re.findall('/artist.*</a>', htmlText)[1] 
      artist = re.findall('>.*<', artist)[0] 
      artist = artist[1:-1] 

      print 'thread %d found artist:', (self.thread_id, artist) 

    # we're overriding the default join function for the thread so 
    # that we can make sure it stops 
    def join(self, timeout=None): 
     self.stop_request.set() 
     super(Crawler, self).join(timeout) 

if __name__ == '__main__': 
    # how many threads do you want? more is faster, but too many 
    # might get your IP blocked or even bring down the site (DoS attack) 
    n_threads = 10 

    # use a standard queue object (thread-safe) for communication 
    queue = Queue.Queue() 

    # create our threads 
    threads = [] 
    for i in range(n_threads): 
     threads.append(Crawler(i, queue)) 

    # generate the urls and fill the queue 
    url_template = 'http://www.billboard.com/charts/%s/hot-100' 
    start_date = datetime.datetime(year=1958, month=8, day=9) 
    end_date = datetime.datetime(year=1959, month=9, day=5) 
    delta = datetime.timedelta(weeks=1) 

    week = 0 
    date = start_date + delta*week 
    while date <= end_date: 
     url = url_template % date.strftime('%Y-%m-%d') 
     queue.put(url) 
     week += 1 
     date = start_date + delta*week 

    # start crawling! 
    for t in threads: 
     t.start() 

    # wait until the queue is empty 
    while not queue.empty(): 
     time.sleep(0.01) 

    # kill the threads 
    for t in threads: 
     t.join() 
+0

也许可以更好地解释使用并发性来提高CPU性能(并行性)和使用并发性来更加彻底地提高数据吞吐量或响应能力(您在此处执行的并发性的种类)之间的差异,这样做对OP有更深入的理解为什么这个工程。 – Wes

+0

非常有帮助和很好的解释Brendan,谢谢! –

+0

优秀的答案@BrendanWood。找到该队列的并发性绝对是做到这一点的方法。有50个并发线程(在我的家庭计算机/网络上进行测试是极限),这大概需要10分钟。真棒! – w00tw00t111

-1

选项1: 使用Threads向服务器发出“同时”请求。

选项2: 分发到多台机器工作,最好的解决办法是使用Storm

+0

async io怎么样? – dpn