2012-06-11 126 views
8

我是scrapy的新手。我正在编写一个蜘蛛程序,专门用于检查服务器状态代码的URL列表,并在适当情况下检查它们重定向到的URL。重要的是,如果有一连串的重定向,我需要知道每次跳转时的状态码和网址。我正在使用response.meta ['redirect_urls']来捕获url,但我不确定如何捕获状态码 - 它似乎没有响应元键。用scrapy蜘蛛抓取http状态码

我意识到我可能需要编写一些定制的中间件来公开这些值,但我不太清楚如何记录每一跳的状态码,以及如何从蜘蛛中访问这些值。我曾经看过,但找不到任何人这样做的例子。如果任何人都可以指出我正确的方向,将不胜感激。

例如,

items = [] 
    item = RedirectItem() 
    item['url'] = response.url 
    item['redirected_urls'] = response.meta['redirect_urls']  
    item['status_codes'] = #???? 
    items.append(item) 

编辑研究 - 基于warawauk并从IRC频道的家伙还真有些主动帮助反馈(freenode上#scrappy)我已经成功地做到这一点。我相信这是一个有点哈克所以对于改善欢迎任何意见:

(1)禁止在设置默认的中间件,并添加您自己:

DOWNLOADER_MIDDLEWARES = { 
    'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None, 
    'myproject.middlewares.CustomRedirectMiddleware': 100, 
} 

(2)创建您的中间件的CustomRedirectMiddleware的.py。它从主redirectmiddleware类继承和捕获重定向:

class CustomRedirectMiddleware(RedirectMiddleware): 
    """Handle redirection of requests based on response status and meta-refresh html tag""" 

    def process_response(self, request, response, spider): 
     #Get the redirect status codes 
     request.meta.setdefault('redirect_status', []).append(response.status) 
     if 'dont_redirect' in request.meta: 
      return response 
     if request.method.upper() == 'HEAD': 
      if response.status in [301, 302, 303, 307] and 'Location' in response.headers: 
       redirected_url = urljoin(request.url, response.headers['location']) 
       redirected = request.replace(url=redirected_url) 

       return self._redirect(redirected, request, spider, response.status) 
      else: 
       return response 

     if response.status in [302, 303] and 'Location' in response.headers: 
      redirected_url = urljoin(request.url, response.headers['location']) 
      redirected = self._redirect_request_using_get(request, redirected_url) 
      return self._redirect(redirected, request, spider, response.status) 

     if response.status in [301, 307] and 'Location' in response.headers: 
      redirected_url = urljoin(request.url, response.headers['location']) 
      redirected = request.replace(url=redirected_url) 
      return self._redirect(redirected, request, spider, response.status) 

     if isinstance(response, HtmlResponse): 
      interval, url = get_meta_refresh(response) 
      if url and interval < self.max_metarefresh_delay: 
       redirected = self._redirect_request_using_get(request, url) 
       return self._redirect(redirected, request, spider, 'meta refresh') 


     return response 

(3)现在,您可以

request.meta['redirect_status'] 
+1

您应该发布您的解决方案作为答案 – raben

回答

2

访问你的蜘蛛重定向的名单我认为是可以作为

response.status 

请参阅http://doc.scrapy.org/en/0.14/topics/request-response.html#scrapy.http.Response

+0

感谢您的响应lindelof。我的困难在于response.status的典型用法会在所有重定向之后为您提供最终响应的响应状态。我需要response.status每一跳,我不清楚如何捕获所有这些。那有意义吗? – reportingmonkey

+0

你可以像使用['redirect_urls'];} –

+0

一样追加状态码哦,我明白了,我误解了。那么我认为你需要根据http://doc.scrapy.org/en/0.14/topics/spider-middleware.html#writing-your-own-spider-middleware继承'scrapy.contrib.spidermiddleware.SpiderMiddleware'并且重写'process_spider_input'来追加中间状态码,比如说'response.meta ['status_codes']'应该被初始化为一个空列表。但我没有尝试过。 – lindelof

3

response.meta['redirect_urls'RedirectMiddleware填充。你的蜘蛛回调将永远不会收到回应,只有最后一次重定向。

如果您想要控制进程,子类RedirectMiddleware,请禁用原始进程并启用您的进程。然后您可以控制重定向过程,包括跟踪响应状态。

原来这里是实现(scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware):

class RedirectMiddleware(object): 
    """Handle redirection of requests based on response status and meta-refresh html tag""" 

    def _redirect(self, redirected, request, spider, reason): 
     ... 
            redirected.meta['redirect_urls'] = request.meta.get('redirect_urls', []) + \ 
                [request.url] 

正如你看到_redirect方法,从不同的部位叫做创建meta['redirect_urls']

而在process_response方法return self._redirect(redirected, request, spider, response.status)被调用,这意味着原始响应不会传递给蜘蛛。

+0

感谢warwaruk,这很有道理。我正在查看redirectmiddleware。我想我可以对这部分进行逆向工程。我想我仍然在这里丢失了一些东西,虽然这个类引用了request.meta.get ['redirect_urls'],所以我认为值是通过每个请求。这也有道理,但我无法找到实际发生的情况。我将编辑我的原始帖子,看看我是否可以澄清我在哪里挣扎 – reportingmonkey

+0

@ user1449163,这个中间件是创建'meta ['redirect_urls']' - 查看答案的更新 – warvariuc

0

KISS的解决方案:我认为这是最好添加的代码的严格的最低捕捉新的重定向场,让RedirectMiddleware做休息:

from scrapy.contrib.downloadermiddleware.redirect import RedirectMiddleware 

class CustomRedirectMiddleware(RedirectMiddleware): 
    """Handle redirection of requests based on response status and meta-refresh html tag""" 

    def process_response(self, request, response, spider): 
    #Get the redirect status codes 
    request.meta.setdefault('redirect_status', []).append(response.status) 
    response = super(CustomRedirectMiddleware, self).process_response(request, response, spider) 
    return response 

然后,继承BaseSpider,您可以访问REDIRECT_STATUS与以下内容:

def parse(self, response): 
     item = ScrapyGoogleindexItem() 
     item['redirections'] = response.meta.get('redirect_times', 0) 
     item['redirect_status'] = response.meta['redirect_status'] 
     return item