1
我有一个简单的代码在Scrapy -Scrapy产量响应对象
def start_requests(self):
response = scrapy.Request(url,callback=self.parse_response)
response.meta['some_useful_params'] = some_useful_params
yield response
def parse_respone(self,resposne):
some_useful_params = response.meta['some_useful_params']
do_parsing_stuff()
if some_conditon==True:
presponse = scrapy.Request(otherurl,callback=self.parse_response)
presponse.meta['some_useful_params'] = some_useful_params
yield presponse
else:
yield items
上述程序工作正常,我,但我需要改变它的东西,将检查如果HTML已经存在该页面,然后将其作为html,而不是向网站发送请求。
现在该代码 -
def start_requests(self):
if html_exist:
request = scrapy.Request(url)
request.meta['some_useful_params'] = some_useful_params
response = scrapy.http.Response(url,body=cached_html,request=request)
#the below line doesn't call the method parse_response
self.parse_response(response)
else:
response = scrapy.Request(url,callback=self.parse_response)
response.meta['some_useful_params'] = some_useful_params
yield response
def parse_respone(self,resposne):
some_useful_params = response.meta['some_useful_params']
do_parsing_stuff()
if some_conditon==True:
if html_exist:
request = scrapy.Request(url)
request.meta['some_useful_params'] = some_useful_params
presponse = scrapy.http.Response(url,body=cached_html,request=request)
#the below line doesn't call the method parse_response
self.parse_response(presponse)
else:
presponse = scrapy.Request(otherurl,callback=self.parse_response)
presponse.meta['some_useful_params'] = some_useful_params
yield presponse
else:
yield items
我现在面临的问题是在第二个代码,如果HTML退出,来电parse_response方法不会发生。
尽管我完全不明白原因,但我认为它与Python生成器有关,我该如何解决这个问题。
有没有什么办法,通过它我可以将HTML(缓存副本),以请求对象,而不是做一个请求网站? – sagar
你可以使用['HttpCacheMiddleware'](https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpcache) – eLRuLL