下面是使用Scrapy的解决方案。看看在overview,你就会明白,它是专为这种任务的工具:
- 它正迅速(基于双绞线)
- 易于使用和理解
- 建-in提取基于XPath的机制(可以使用
bs
或lxml
太虽然)
- 内置支持流水线提取的项目数据库,XML,JSON无论
- 和更多的功能
这里的工作蜘蛛,提取你问的一切(15分钟的工作对我,而老,笔记本电脑):
import datetime
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class BillBoardItem(Item):
date = Field()
song = Field()
artist = Field()
BASE_URL = "http://www.billboard.com/charts/%s/hot-100"
class BillBoardSpider(BaseSpider):
name = "billboard_spider"
allowed_domains = ["billboard.com"]
def __init__(self):
date = datetime.date(year=1958, month=8, day=9)
self.start_urls = []
while True:
if date.year >= 2013:
break
self.start_urls.append(BASE_URL % date.strftime('%Y-%m-%d'))
date += datetime.timedelta(days=7)
def parse(self, response):
hxs = HtmlXPathSelector(response)
date = hxs.select('//span[@class="chart_date"]/text()').extract()[0]
songs = hxs.select('//div[@class="listing chart_listing"]/article')
for song in songs:
item = BillBoardItem()
item['date'] = date
try:
item['song'] = song.select('.//header/h1/text()').extract()[0]
item['artist'] = song.select('.//header/p[@class="chart_info"]/a/text()').extract()[0]
except:
continue
yield item
保存成billboard.py
,并通过scrapy runspider billboard.py -o output.json
运行。然后,在output.json
你会看到:
...
{"date": "September 20, 1958", "artist": "Domenico Modugno", "song": "Nel Blu Dipinto Di Blu (Volar\u00c3\u00a9)"}
{"date": "September 20, 1958", "artist": "The Everly Brothers", "song": "Bird Dog"}
{"date": "September 20, 1958", "artist": "The Elegants", "song": "Little Star"}
{"date": "September 20, 1958", "artist": "Tommy Edwards", "song": "It's All In The Game"}
{"date": "September 20, 1958", "artist": "Jimmy Clanton And His Rockets", "song": "Just A Dream"}
{"date": "September 20, 1958", "artist": "Poni-Tails", "song": "Born Too Late"}
{"date": "September 20, 1958", "artist": "The Olympics", "song": "Western Movies"}
{"date": "September 20, 1958", "artist": "Little Anthony And The Imperials", "song": "Tears On My Pillow"}
{"date": "September 20, 1958", "artist": "Robin Luke", "song": "Susie Darlin'"}
{"date": "September 27, 1958", "artist": "Domenico Modugno", "song": "Nel Blu Dipinto Di Blu (Volar\u00c3\u00a9)"}
{"date": "September 27, 1958", "artist": "The Everly Brothers", "song": "Bird Dog"}
{"date": "September 27, 1958", "artist": "Tommy Edwards", "song": "It's All In The Game"}
{"date": "September 27, 1958", "artist": "The Elegants", "song": "Little Star"}
{"date": "September 27, 1958", "artist": "Jimmy Clanton And His Rockets", "song": "Just A Dream"}
{"date": "September 27, 1958", "artist": "Little Anthony And The Imperials", "song": "Tears On My Pillow"}
{"date": "September 27, 1958", "artist": "Robin Luke", "song": "Susie Darlin'"}
...
而且,看看grequests作为替代工具。
希望有所帮助。
[Scrapy(https://scrapy.readthedocs.org/)将执行好多了,它的该工作的工具,当然。让我知道你是否可以 - 我会为你写一个蜘蛛样本。 – alecxe
改进将包括不使用urllib2,不使用正则表达式来解析html,并使用多个线程来执行您的I/O。 – roippi
我真诚怀疑'urllib2'与任何效率问题都有关系。它所做的只是发出请求并拉下回应;有99.99%的时间是网络时间,没有其他方法可以改善。问题是(a)你的解析代码可能会很慢,(b)你可能会做很多重复或不必要的下载,(c)你需要并行下载(你可以用'urllib2' (d)您需要更快的网络连接,或者(e)billboard.com正在限制您的工作。 – abarnert