2014-06-24 40 views
2

我正在运行下面的蜘蛛,但它没有进入解析方法,我不知道为什么,有人请帮忙。Scrapy没有进入解析函数

我的代码如下

from scrapy.item import Item, Field 
    from scrapy.selector import Selector 
    from scrapy.spider import BaseSpider 
    from scrapy.selector import HtmlXPathSelector 


    class MyItem(Item): 
     reviewer_ranking = Field() 
     print "asdadsa" 


    class MySpider(BaseSpider): 
     name = 'myspider' 
     allowed_domains = ["amazon.com"] 
     start_urls = ["http://www.amazon.com/gp/pdp/profile/A28XDLTGHPIWE1/ref=cm_cr_pr_pdp"] 
     print"sadasds" 
     def parse(self, response): 
      print"fggfggftgtr" 
      sel = Selector(response) 
      hxs = HtmlXPathSelector(response) 
      item = MyItem() 
      item["reviewer_ranking"] = hxs.select('//span[@class="a-size-small a-color-secondary"]/text()').extract() 
      return item 

这我得到的输出是如下

$ scrapy runspider crawler_reviewers_data.py 
    asdadsa 
    sadasds 
    /home/raj/Documents/IIM A/Daily sales rank/Daily  reviews/Reviews_scripts/Scripts_review/Reviews/Reviewer/crawler_reviewers_data.py:18:  ScrapyDeprecationWarning: crawler_reviewers_data.MySpider inherits from deprecated class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (warning only on first subclass, there may be others) 
    class MySpider(BaseSpider): 
    2014-06-24 19:21:35+0530 [scrapy] INFO: Scrapy 0.22.2 started (bot: scrapybot) 
    2014-06-24 19:21:35+0530 [scrapy] INFO: Optional features available: ssl, http11 
    2014-06-24 19:21:35+0530 [scrapy] INFO: Overridden settings: {} 
    2014-06-24 19:21:35+0530 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 
    2014-06-24 19:21:35+0530 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats 
    2014-06-24 19:21:35+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
    2014-06-24 19:21:35+0530 [scrapy] INFO: Enabled item pipelines: 
    2014-06-24 19:21:35+0530 [myspider] INFO: Spider opened 
    2014-06-24 19:21:35+0530 [myspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
    2014-06-24 19:21:35+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6027 
    2014-06-24 19:21:35+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6084 
    2014-06-24 19:21:36+0530 [myspider] DEBUG: Crawled (403) <GET  http://www.amazon.com/gp/pdp/profile/A28XDLTGHPIWE1/ref=cm_cr_pr_pdp> (referer: None) ['partial'] 
    2014-06-24 19:21:36+0530 [myspider] INFO: Closing spider (finished) 
    2014-06-24 19:21:36+0530 [myspider] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 259, 
'downloader/request_count': 1, 
'downloader/request_method_count/GET': 1, 
'downloader/response_bytes': 28487, 
'downloader/response_count': 1, 
'downloader/response_status_count/403': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2014, 6, 24, 13, 51, 36, 631236), 
'log_count/DEBUG': 3, 
'log_count/INFO': 7, 
'response_received_count': 1, 
'scheduler/dequeued': 1, 
'scheduler/dequeued/memory': 1, 
'scheduler/enqueued': 1, 
'scheduler/enqueued/memory': 1, 
'start_time': datetime.datetime(2014, 6, 24, 13, 51, 35, 472849)} 
    2014-06-24 19:21:36+0530 [myspider] INFO: Spider closed (finished) 

请帮助我,我被困在这个非常一点。

回答

2

这是一个反网络爬行技术,使用Amazon - 你得到403 - Forbidden,因为它需要User-Agent头与请求一起发送。

一种选择是手动将其添加到从start_requests()产生的Request

class MySpider(BaseSpider): 
    name = 'myspider' 
    allowed_domains = ["amazon.com"] 

    def start_requests(self): 
     yield Request("https://www.amazon.com/gp/pdp/profile/A28XDLTGHPIWE1/ref=cm_cr_pr_pdp", 
         headers={'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}) 

    ... 

另一种选择是设置DEFAULT_REQUEST_HEADERS设定项目范围。

另请注意,Amazon提供了一个API它有一个python wrapper,考虑使用它。

希望有所帮助。

+0

非常感谢您的快速响应。手动添加方法不起作用,我得到相同的403错误。你能告诉我如何设置一个蜘蛛的Default_request_headers? – Raj

+0

@ user2019135你是否删除了'start_urls'属性?因为我在发布前测试了代码 - 适用于我。 – alecxe

+0

@ user2019135这是[蜘蛛看起来如何](https://gist.github.com/alecxe/46f95778072ce4b59e79)。 – alecxe