2017-08-02 156 views
0

我试图在代码中的网站刮鞋价格。我不知道我的语法是否正确。我真的可以用一些帮助。Scrapy报告0页抓取

from scrapy.spider import BaseSpider 
from scrapy import Field 
from scrapy import Item 
from scrapy.selector import HtmlXPathSelector 

def Yeezy(Item): 
price = Field() 


class YeezySpider(BaseSpider): 
    name = "yeezy" 
    allowed_domains = ["https://www.grailed.com/"] 
    start_url = ['https://www.grailed.com/feed/0Qu8Gh1qHQ?page=2'] 

def parse(self, response): 
    hxs = HtmlXPathSelector(response) 
    price = hxs.css('.listing-price .sub-title:nth-child(1) span').extract() 
    items = [] 
    for price in price: 
     item = Yeezy() 
     item["price"] = price.select(".listing-price .sub-title:nth-child(1) span").extract() 
     items.append(item) 
    yield item 

的代码报告这个控制台:

ScrapyDeprecationWarning: YeezyScrape.spiders.yeezy_spider.YeezySpider  inherits from deprecated class scrapy.spider.BaseSpider, please inherit from  scrapy.spider.Spider. (warning only on first subclass, there may be others) 
    class YeezySpider(BaseSpider): 
2017-08-02 14:45:25-0700 [scrapy] INFO: Scrapy 0.25.1 started (bot: YeezyScrape) 
2017-08-02 14:45:25-0700 [scrapy] INFO: Optional features available: ssl,  http11 
2017-08-02 14:45:25-0700 [scrapy] INFO: Overridden settings:  {'NEWSPIDER_MODULE': 'YeezyScrape.spiders', 'SPIDER_MODULES':  ['YeezyScrape.spiders'], 'BOT_NAME': 'YeezyScrape'} 
2017-08-02 14:45:25-0700 [scrapy] INFO: Enabled extensions: LogStats,  TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled item pipelines: 
2017-08-02 14:45:26-0700 [yeezy] INFO: Spider opened 
2017-08-02 14:45:26-0700 [yeezy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-08-02 14:45:26-0700 [scrapy] DEBUG: Telnet console listening on  127.0.0.1:6023 
2017-08-02 14:45:26-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080 
2017-08-02 14:45:26-0700 [yeezy] INFO: Closing spider (finished) 
2017-08-02 14:45:26-0700 [yeezy] INFO: Dumping Scrapy stats: 
{'finish_reason': 'finished', 
'finish_time': datetime.datetime(2017, 8, 2, 21, 45, 26, 127000), 
'log_count/DEBUG': 2, 
'log_count/INFO': 7, 
'start_time': datetime.datetime(2017, 8, 2, 21, 45, 26, 125000)} 
2017-08-02 14:45:26-0700 [yeezy] INFO: Spider closed (finished) 

Process finished with exit code 0 

起初我还以为是我进入了CSS元素一个问题,但现在我不那么肯定。这是我第一次尝试这样的项目,我真的可以使用一些见解。先谢谢你。

编辑:所以我尝试模仿在我的代码中的xhr请求通过下面的另一个例子。这是我的:

import scrapy 
from scrapy.http import FormRequest 
from scrapy.selector import HtmlXPathSelector 
#from YeezyScrape import YeezyscrapeItem 


class YeezySpider(scrapy.Spider): 
    name = "yeezy" 
    allowed_domains = ["www.grailed.com"] 
    start_url = ["https://www.grailed.com/feed/0Qu8Gh1qHQ?page=2"] 

    def parse(self, response): 
     for i in range(0,2): 
      yield FormRequest(url = 'https://mnrwefss2q- 
dsn.algolia.net/1/indexes/Listing_production/query?x-algolia- 
agent=Algolia%20for%20vanilla%20JavaScript%203.21.1&x-algolia-application- 
id=MNRWEFSS2Q&x-algolia-api-key=a3a4de2e05d9e9b463911705fb6323ad', 
method="post", formdata={"params":"query:boost 
filters:(strata:'basic' OR strata:'grailed' OR strata:'hype') AND 
(category_path:'footwear.slip_ons' OR category_path:'footwear.sandals' OR 
category_path:'footwear.lowtop_sneakers' OR category_path:'footwear.leather' 
OR category_path:'footwear.hitop_sneakers' OR 
category_path:'footwear.formal_shoes' OR category_path:'footwear.boots') AND 
(marketplace:grailed) 
hitsPerPage:40 
facets ["strata","size","category","category_size", 
"category_path","category_path_size", 
"category_path_root_size","price_i","designers.id", 
"location","marketplace"] 
page:2"}, callback=self.data_parse()) 

def data_parse(self, response): 
    hxs = HtmlXPathSelector(response) 
    prices = hxs.xpath("//p").extract() 
    for prices in prices: 
     price = prices.select("a/text()").extract() 
     print price 

我不得不重新格式化一些东西,以适应Python和Stackoverflow之间的缩进差异。

这些都是在终端上报的日志,再次感谢您的帮助:

C:\Python27\python.exe C:/Python27/Lib/site-packages/scrapy/cmdline.py crawl yeezy -o price.json 
2017-08-04 13:23:27-0700 [scrapy] INFO: Scrapy 0.25.1 started (bot: YeezyScrape) 
2017-08-04 13:23:27-0700 [scrapy] INFO: Optional features available: ssl, http11 
2017-08-04 13:23:27-0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'YeezyScrape.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['YeezyScrape.spiders'], 'FEED_URI': 'price.json', 'BOT_NAME': 'YeezyScrape'} 
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled item pipelines: 
2017-08-04 13:23:27-0700 [yeezy] INFO: Spider opened 
2017-08-04 13:23:28-0700 [yeezy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-08-04 13:23:28-0700 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2017-08-04 13:23:28-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080 
2017-08-04 13:23:28-0700 [yeezy] INFO: Closing spider (finished) 
2017-08-04 13:23:28-0700 [yeezy] INFO: Dumping Scrapy stats: 
    {'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2017, 8, 4, 20, 23, 28, 3000), 
    'log_count/DEBUG': 2, 
    'log_count/INFO': 7, 
    'start_time': datetime.datetime(2017, 8, 4, 20, 23, 28, 1000)} 
2017-08-04 13:23:28-0700 [yeezy] INFO: Spider closed (finished) 

Process finished with exit code 0 

回答