在python中刮去网页

我对抓取网页完全陌生，但我真的很想在python中学习它。我对python有基本的了解。在python中刮去网页

我无法理解代码来刮网页，因为我找不到有关代码使用的模块的良好文档。

代码下脚料this网页的一些电影的数据

我卡住了评论“评选的模式如下CSS规则”之后。

我想了解该代码背后的逻辑或理解该模块的好文档。以前有没有我需要学习的话题？

的代码如下：

import requests 
from pattern import web 
from BeautifulSoup import BeautifulSoup 

url = 'http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=1950,2012' 
r = requests.get(url) 
print r.url 

url = 'http://www.imdb.com/search/title' 
params = dict(sort='num_votes,desc', start=1, title_type='feature', year='1950,2012') 
r = requests.get(url, params=params) 
print r.url # notice it constructs the full url for you 

#selection in pattern follows the rules of CSS 

dom = web.Element(r.text) 
for movie in dom.by_tag('td.title'):  
    title = movie.by_tag('a')[0].content 
    genres = movie.by_tag('span.genre')[0].by_tag('a') 
    genres = [g.content for g in genres] 
    runtime = movie.by_tag('span.runtime')[0].content 
    rating = movie.by_tag('span.value')[0].content 
    print title, genres, runtime, rating

来源

2014-01-12 CreamStat

下面是BeautifulSoup的文档，这是一个HTML和XML解析器。

选择的模式如下CSS规则

意味着字符串，如'td.title'和'span.runtime'是CSS选择器，可以帮助找到你所寻找的，其中td.title搜索数据的注释对于属性为class="title"的<TD>元素。

该代码正在遍历网页正文中的HTML元素，并通过CSS选择器提取标题，流派，运行时和评级。

来源

2014-01-12 04:17:00 haferje

在python中刮去网页

回答

相关问题