2012-11-22 52 views
0

我无法关注链接并找回值。无法关注使用Scrapy的链接

我尝试使用下面的代码我能够抓取第一个链接后,它没有重定向到第二个后续链接(功能)。

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http.request import Request 


class ScrapyOrgSpider(BaseSpider): 
    name = "scrapy" 
    allowed_domains = ["example.com"] 
    start_urls = ["http://www.example.com/abcd"] 


    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     res1=Request("http://www.example.com/follow", self.a_1) 
     print res1 

    def a_1(self, response1): 
     hxs2 = HtmlXPathSelector(response1) 
     print hxs2.select("//a[@class='channel-link']").extract()[0] 
     return response1 

回答

0

你忘了回报您的要求在parse()方法。试试这个代码:

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http.request import Request 


class ScrapyOrgSpider(BaseSpider): 
    name = "example.com" 
    allowed_domains = ["example.com"] 
    start_urls = ["http://www.example.com/abcd"] 

    def parse(self, response): 
     self.log('@@ Original response: %s' % response) 
     req = Request("http://www.example.com/follow", callback=self.a_1) 
     self.log('@@ Next request: %s' % req) 
     return req 

    def a_1(self, response): 
     hxs = HtmlXPathSelector(response) 
     self.log('@@ extraction: %s' % 
      hxs.select("//a[@class='channel-link']").extract()) 

日志输出:

2012-11-22 12:20:06-0600 [scrapy] INFO: Scrapy 0.17.0 started (bot: oneoff) 
2012-11-22 12:20:06-0600 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 
2012-11-22 12:20:06-0600 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2012-11-22 12:20:06-0600 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2012-11-22 12:20:06-0600 [scrapy] DEBUG: Enabled item pipelines: 
2012-11-22 12:20:06-0600 [example.com] INFO: Spider opened 
2012-11-22 12:20:06-0600 [example.com] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2012-11-22 12:20:06-0600 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 
2012-11-22 12:20:06-0600 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 
2012-11-22 12:20:07-0600 [example.com] DEBUG: Redirecting (302) to <GET http://www.iana.org/domains/example/> from <GET http://www.example.com/abcd> 
2012-11-22 12:20:07-0600 [example.com] DEBUG: Crawled (200) <GET http://www.iana.org/domains/example/> (referer: None) 
2012-11-22 12:20:07-0600 [example.com] DEBUG: @@ Original response: <200 http://www.iana.org/domains/example/> 
2012-11-22 12:20:07-0600 [example.com] DEBUG: @@ Next request: <GET http://www.example.com/follow> 
2012-11-22 12:20:07-0600 [example.com] DEBUG: Redirecting (302) to <GET http://www.iana.org/domains/example/> from <GET http://www.example.com/follow> 
2012-11-22 12:20:08-0600 [example.com] DEBUG: Crawled (200) <GET http://www.iana.org/domains/example/> (referer: http://www.iana.org/domains/example/) 
2012-11-22 12:20:08-0600 [example.com] DEBUG: @@ extraction: [] 
2012-11-22 12:20:08-0600 [example.com] INFO: Closing spider (finished) 
0

parse函数必须回报的要求,而不只是打印出来。

def parse(self, response): 
    hxs = HtmlXPathSelector(response) 
    res1 = Request("http://www.example.com/follow", callback=self.a_1) 
    print res1 # if you want 
    return res1