1
我对抓取网页完全陌生,但我真的很想在python中学习它。我对python有基本的了解。在python中刮去网页
我无法理解代码来刮网页,因为我找不到有关代码使用的模块的良好文档。
代码下脚料this网页的一些电影的数据
我卡住了评论“评选的模式如下CSS规则”之后。
我想了解该代码背后的逻辑或理解该模块的好文档。以前有没有我需要学习的话题?
的代码如下:
import requests
from pattern import web
from BeautifulSoup import BeautifulSoup
url = 'http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=1950,2012'
r = requests.get(url)
print r.url
url = 'http://www.imdb.com/search/title'
params = dict(sort='num_votes,desc', start=1, title_type='feature', year='1950,2012')
r = requests.get(url, params=params)
print r.url # notice it constructs the full url for you
#selection in pattern follows the rules of CSS
dom = web.Element(r.text)
for movie in dom.by_tag('td.title'):
title = movie.by_tag('a')[0].content
genres = movie.by_tag('span.genre')[0].by_tag('a')
genres = [g.content for g in genres]
runtime = movie.by_tag('span.runtime')[0].content
rating = movie.by_tag('span.value')[0].content
print title, genres, runtime, rating