2015-06-03 35 views
0

这是我第一次使用scrapy的刮板。从嵌套定位标记中抓取网址和标题

我想取消视频网址,标题从https://www.google.co.in/trends/hotvideos#hvsm=0网站。

import scrapy 
from scrapy.item import Item, Field 
from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 

class CraigslistItem(Item): 
    title = Field() 
    link = Field() 

class DmozSpider(scrapy.Spider): 
    name = "google" 
    allowed_domains = ["google.co.in"] 
    start_urls = [ 
     "https://www.google.co.in/trends/hotvideos#hvsm=0" 
    ] 

    def parse(self, response): 
     #for sel in response.xpath('//body/div'): 
    hxs = HtmlXPathSelector(response) 
    sites = hxs.xpath("//span[@class='single-video-image-container']") 
    items = [] 
    for sel in response.xpath("//span[@class='single-video-image-container']"): 
     item = CraigslistItem() 
     item['title'] = sel.xpath('a/text()').extract() 
     item['link'] = sel.xpath('a/@href').extract() 
     items.append(item) 
     print items 

一般来说,我做错了什么将是非常可观的。

+0

由于'POST'请求​​会显示这些影片列表,因此您无法使用此功能。 尝试使用scrapy [form-request](http://doc.scrapy.org/en/latest/topics/request-response.html#request-usage-examples) – Jithin

+0

@Jathin:谢谢,但我真的无法得到您。请你详细说明 – nlper

+1

尽管你正在向这个[url](https://www.google.co.in/trends/hotvideos#hvsm=0)提出请求来获取电影列表,实际上在内部是一个'ajax-post-请求'被触发,并作为响应,你得到的电影列表在该页面 – Jithin

回答

2

使用帮助ScrapyFormRequest来完成它。

from scrapy.http import FormRequest 
import json 

class DmozSpider(scrapy.Spider): 
    name = "google" 
    allowed_domains = ["google.co.in"] 
    start_urls = [ 
     "https://www.google.co.in/trends/hotvideos#hvsm=0" 
    ] 

    def parse(self, response): 
     url = 'https://www.google.co.in/trends/hotvideos/hotItems' 
     formdata = {'hvd':'','geo': 'IN','mob': '0','hvsm': '0'} 
     yield FormRequest(url=url, formdata=formdata, callback=self.parse_data) 

    def parse_data(self, response): 
     json_response = json.loads(response.body) 
     videos = json_response.get('videoList') 
     for video in videos: 
      item = CraigslistItem() 
      item['title'] = video.get('title') 
      item['link'] = video.get('url') 
      yield item 
+0

非常感谢,但我不明白代码,你可以补充说明吗?以及为什么更改网址名称,即“https:// www.google.co.in/ trends/hotvideos/hotItems” – nlper