2016-10-12 26 views
0

我对Python和Scrapy相当陌生,但看起来似乎并不正确。根据文档和示例,重新实现start_requests函数将导致Scrapy使用返回start_requests而不是start_urls数组变量。Scrapy不能使用start_requests调用解析函数

一切工作正常start_urls,但是当我添加start_requests,它不会进入解析功能。文件指出,解析方法是

使用Scrapy处理下载的响应的默认回调, 当他们的请求不指定回调

解析永远不会执行,跟踪我的记录器打印。

这是我的代码,它很短,因为我只是陪伴它而已。

class Crawler(scrapy.Spider): 

    name = 'Hearthpwn' 
    allowed_domains = ['hearthpwn.com'] 
    storage_dir = 'C:/Users/Michal/PycharmProjects/HearthpwnCrawler/' 
    start_urls = ['http://www.hearthpwn.com/decks/645987-nzoth-warrior'] 

    def start_requests(self): 

     logging.log(logging.INFO, "Loading requests") 
     yield Request(url='http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter') 

    def parse(self, response): 

     logging.log(logging.INFO, "parsing response") 

     filename = response.url.split("/")[-1] + '.html' 
     with open('html/' + filename, 'wb') as f: 
      f.write(response.body) 

process = CrawlerProcess({ 
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' 
}) 
process.crawl(Crawler) 
process.start() 

并打印控制台:

2016-10-12 15:33:39 [scrapy] INFO: Scrapy 1.2.0 started (bot: scrapybot) 
2016-10-12 15:33:39 [scrapy] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'} 
2016-10-12 15:33:39 [scrapy] INFO: Enabled extensions: 
['scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.corestats.CoreStats', 
'scrapy.extensions.logstats.LogStats'] 
2016-10-12 15:33:39 [scrapy] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2016-10-12 15:33:39 [scrapy] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2016-10-12 15:33:39 [scrapy] INFO: Enabled item pipelines: 
[] 
2016-10-12 15:33:39 [scrapy] INFO: Spider opened 
2016-10-12 15:33:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-10-12 15:33:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-10-12 15:33:39 [root] INFO: Loading requests 
2016-10-12 15:33:41 [scrapy] DEBUG: Redirecting (302) to <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter?cookieTest=1> from <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter> 
2016-10-12 15:33:41 [scrapy] DEBUG: Redirecting (302) to <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter> from <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter?cookieTest=1> 
2016-10-12 15:33:41 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates) 
2016-10-12 15:33:41 [scrapy] INFO: Closing spider (finished) 
2016-10-12 15:33:41 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 655, 
'downloader/request_count': 2, 
'downloader/request_method_count/GET': 2, 
'downloader/response_bytes': 1248, 
'downloader/response_count': 2, 
'downloader/response_status_count/302': 2, 
'dupefilter/filtered': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 10, 12, 13, 33, 41, 740724), 
'log_count/DEBUG': 4, 
'log_count/INFO': 8, 
'scheduler/dequeued': 2, 
'scheduler/dequeued/memory': 2, 
'scheduler/enqueued': 2, 
'scheduler/enqueued/memory': 2, 
'start_time': datetime.datetime(2016, 10, 12, 13, 33, 39, 441736)} 
2016-10-12 15:33:41 [scrapy] INFO: Spider closed (finished) 

感谢您的任何线索。

回答

1

使用dont_merge_cookies字典属性将解决这个问题。

def start_requests(self): 

     logging.log(logging.INFO, "Loading requests") 
     yield Request(url='http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter', 
         meta={'dont_merge_cookies': True}) 
+0

都感谢你和@Granitosaurus!虽然这个答案足以满足我的意图,但都给了我一个有趣的见解。我结束了重定向样式的链接,很容易将名称解析为原始表单并保存。 –

0
2016-10-12 15:33:41 [scrapy] DEBUG: Redirecting (302) to <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter?cookieTest=1> from <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter> 
2016-10-12 15:33:41 [scrapy] DEBUG: Redirecting (302) to <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter> from <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter?cookieTest=1> 
2016-10-12 15:33:41 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates) 

这里发生的是,该网站重定向你几次,你最终因为两次爬行相同的URL。 Scrapy蜘蛛默认情况下会过滤出重复请求的中间件,因此您在创建Request对象以忽略此中间件时需要将参数dont_filter设置为True

例如为:

def start_requests(self): 
    yield ('http://scrapy.org', dont_filter=True)