2016-07-25 102 views
0

因为我是scrapy的新手,所以我不知道问题出在哪里可能非常容易解决。我希望找到一个解决方案。提前致谢。为什么我的“scrapy”不能刮取任何东西?

我使用utnutu 14.04,蟒蛇3.4

我的蜘蛛:

``

class EnActressSpider(scrapy.Spider): 
    name = "en_name" 
    allowed_domains = ["www.r18.com/", "r18.com/"] 
    start_urls = ["http://www.r18.com/videos/vod/movies/actress/letter=a/sort=popular/page=1",] 


def parse(self, response): 
    for sel in response.xpath('//*[@id="contents"]/div[2]/section/div[3]/ul/li'): 
     item = En_Actress() 
     item['image_urls'] = sel.xpath('a/p/img/@src').extract() 
     name_link = sel.xpath('a/@href').extract() 
     request = scrapy.Request(name_link, callback = self.parse_item, dont_filter=True) 
     request.meta['item'] = item 
     yield request 

    next_page = response.css("#contents > div.main > section > div.cmn-sec-item01.pb00 > div > ol > li.next > a::attr('href')") 
    if next_page: 
     url = response.urljoin(next_page[0].extract()) 
     yield scrapy.Request(url, self.parse, dont_filter=True) 



def parse_item(self, response): 
    item = reponse.meta['item'] 
    name = response.xpath('//*[@id="contents"]/div[1]/ul/li[5]/span/text()') 
    item['name'] = name[0].encode('utf-8') 
    yield item 

``

LOG:

``

{'downloader/request_bytes': 988, 
'downloader/request_count': 2, 
'downloader/request_method_count/GET': 2, 
'downloader/response_bytes': 48547, 
'downloader/response_count': 2, 
'downloader/response_status_count/200': 1, 
'downloader/response_status_count/301': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 7, 25, 6, 46, 36, 940936), 
'log_count/DEBUG': 1, 
'log_count/INFO': 1, 
'response_received_count': 1, 
'scheduler/dequeued': 2, 
'scheduler/dequeued/memory': 2, 
'scheduler/enqueued': 2, 
'scheduler/enqueued/memory': 2, 
'spider_exceptions/TypeError': 1, 
'start_time': datetime.datetime(2016, 7, 25, 6, 46, 35, 908281)} 

``

任何帮助,非常感谢。

+0

你能提供链接到网站你的刮,或者更确切地说什么url'parse()'方法接收?或者只是发布蜘蛛文件的全部内容。 – Granitosaurus

+0

[链接](http://www.r18.com/videos/vod/movies/actress/letter=a/sort=popular/page=1)另外,我编辑了我的问题。谢谢。 Granitosaurus – Jin

回答

0

似乎很少有语法错误。我已经把它清理干净了,它在这里似乎工作的很好。 我做的另一个编辑被删除dont_filter参数从Request对象,因为你不想刮重复。还调整allowed_domains,因为它过滤掉了一些内容。 将来你应该发布整个日志。

import scrapy 
class EnActressSpider(scrapy.Spider): 
    name = "en_name" 
    allowed_domains = ["r18.com"] 
    start_urls = ["http://www.r18.com/videos/vod/movies/actress/letter=a/sort=popular/page=1", ] 

    def parse(self, response): 
     for sel in response.xpath('//*[@id="contents"]/div[2]/section/div[3]/ul/li'): 
      item = dict() 
      item['image_urls'] = sel.xpath('a/p/img/@src').extract() 
      name_link = sel.xpath('a/@href').extract_first() 
      request = scrapy.Request(name_link, callback=self.parse_item) 
      request.meta['item'] = item 
      yield request 

     next_page = response.css(
      "#contents > div.main > section > div.cmn-sec-item01.pb00 > " 
      "div > ol > li.next > a::attr('href')").extract_first() 
     if next_page: 
      url = response.urljoin(next_page) 
      yield scrapy.Request(url, self.parse) 

    def parse_item(self, response): 
     item = response.meta['item'] 
     name = response.xpath('//*[@id="contents"]/div[1]/ul/li[5]/span/text()').extract_first() 
     item['name'] = name.encode('utf-8') 
     yield item 
相关问题