2017-04-01 155 views
1

我有一个简单的代码在Scrapy -Scrapy产量响应对象

def start_requests(self): 
    response = scrapy.Request(url,callback=self.parse_response) 
    response.meta['some_useful_params'] = some_useful_params 
    yield response 

def parse_respone(self,resposne): 
    some_useful_params = response.meta['some_useful_params'] 
    do_parsing_stuff() 
    if some_conditon==True: 
     presponse = scrapy.Request(otherurl,callback=self.parse_response) 
     presponse.meta['some_useful_params'] = some_useful_params 
     yield presponse 
    else: 
     yield items 

上述程序工作正常,我,但我需要改变它的东西,将检查如果HTML已经存在该页面,然后将其作为html,而不是向网站发送请求。

现在该代码 -

def start_requests(self): 
    if html_exist: 
     request = scrapy.Request(url) 
     request.meta['some_useful_params'] = some_useful_params 
     response = scrapy.http.Response(url,body=cached_html,request=request) 
     #the below line doesn't call the method parse_response 
     self.parse_response(response) 
    else: 
     response = scrapy.Request(url,callback=self.parse_response) 
     response.meta['some_useful_params'] = some_useful_params 
     yield response 

def parse_respone(self,resposne): 
    some_useful_params = response.meta['some_useful_params'] 
    do_parsing_stuff() 
    if some_conditon==True: 
     if html_exist: 
      request = scrapy.Request(url) 
      request.meta['some_useful_params'] = some_useful_params 
      presponse = scrapy.http.Response(url,body=cached_html,request=request) 
      #the below line doesn't call the method parse_response 
      self.parse_response(presponse) 
     else: 
      presponse = scrapy.Request(otherurl,callback=self.parse_response) 
      presponse.meta['some_useful_params'] = some_useful_params 
      yield presponse 
    else: 
     yield items 

我现在面临的问题是在第二个代码,如果HTML退出,来电parse_response方法不会发生。

尽管我完全不明白原因,但我认为它与Python生成器有关,我该如何解决这个问题。

回答

0

,你必须得到itemsrequests,不只是调用方法:

for item_or_request in self.parse_response(response): 
    yield item_or_request 
+0

有没有什么办法,通过它我可以将HTML(缓存副本),以请求对象,而不是做一个请求网站? – sagar

+1

你可以使用['HttpCacheMiddleware'](https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpcache) – eLRuLL