我是scrapy的新手。我正在编写一个蜘蛛程序,专门用于检查服务器状态代码的URL列表,并在适当情况下检查它们重定向到的URL。重要的是,如果有一连串的重定向,我需要知道每次跳转时的状态码和网址。我正在使用response.meta ['redirect_urls']来捕获url,但我不确定如何捕获状态码 - 它似乎没有响应元键。用scrapy蜘蛛抓取http状态码
我意识到我可能需要编写一些定制的中间件来公开这些值,但我不太清楚如何记录每一跳的状态码,以及如何从蜘蛛中访问这些值。我曾经看过,但找不到任何人这样做的例子。如果任何人都可以指出我正确的方向,将不胜感激。
例如,
items = []
item = RedirectItem()
item['url'] = response.url
item['redirected_urls'] = response.meta['redirect_urls']
item['status_codes'] = #????
items.append(item)
编辑研究 - 基于warawauk并从IRC频道的家伙还真有些主动帮助反馈(freenode上#scrappy)我已经成功地做到这一点。我相信这是一个有点哈克所以对于改善欢迎任何意见:
(1)禁止在设置默认的中间件,并添加您自己:
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,
'myproject.middlewares.CustomRedirectMiddleware': 100,
}
(2)创建您的中间件的CustomRedirectMiddleware的.py。它从主redirectmiddleware类继承和捕获重定向:
class CustomRedirectMiddleware(RedirectMiddleware):
"""Handle redirection of requests based on response status and meta-refresh html tag"""
def process_response(self, request, response, spider):
#Get the redirect status codes
request.meta.setdefault('redirect_status', []).append(response.status)
if 'dont_redirect' in request.meta:
return response
if request.method.upper() == 'HEAD':
if response.status in [301, 302, 303, 307] and 'Location' in response.headers:
redirected_url = urljoin(request.url, response.headers['location'])
redirected = request.replace(url=redirected_url)
return self._redirect(redirected, request, spider, response.status)
else:
return response
if response.status in [302, 303] and 'Location' in response.headers:
redirected_url = urljoin(request.url, response.headers['location'])
redirected = self._redirect_request_using_get(request, redirected_url)
return self._redirect(redirected, request, spider, response.status)
if response.status in [301, 307] and 'Location' in response.headers:
redirected_url = urljoin(request.url, response.headers['location'])
redirected = request.replace(url=redirected_url)
return self._redirect(redirected, request, spider, response.status)
if isinstance(response, HtmlResponse):
interval, url = get_meta_refresh(response)
if url and interval < self.max_metarefresh_delay:
redirected = self._redirect_request_using_get(request, url)
return self._redirect(redirected, request, spider, 'meta refresh')
return response
(3)现在,您可以
request.meta['redirect_status']
您应该发布您的解决方案作为答案 – raben