0
我努力学习Scrapy和我设法爬一些网站和其他我失败的例子: 我会尝试抓取:http://www.polyhousestore.com/Scrapy犯规获得产品的电子商务网站
我创建了一个测试蜘蛛,将让所有在页面中的产品 http://www.polyhousestore.com/catalogsearch/result/?cat=&q=lc+60
当我运行蜘蛛我得到它没有找到任何产品。 有人可以帮我理解我做错了什么,它与CSS :: before和:: after有关吗? 我该如何让它工作?
蜘蛛的代码(即轮不到在页面中的产品)
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
class PolySpider(scrapy.Spider):
name = "poly"
allowed_domains = ["polyhousestore.com"]
start_urls = (
'http://www.polyhousestore.com/catalogsearch/result/?cat=&q=lc+60',
)
def parse(self, response):
sel = Selector(response)
products = sel.xpath('/html/body/div[4]/div/div[5]/div/div/div/div/div[2]/div[3]/div[2]/div')
items = []
if not products:
print '------------- No products from sel.xpath'
else:
print '------------- Found products ' + str(len(products))
,我跑
命令行和输出
D:\scrapyProj\cmdProj>scrapy crawl poly 2016-01-19 10:23:16 [scrapy] INFO: Scrapy 1.0.3 started (bot: cmdProj) 2016-01-19 10:23:16 [scrapy] INFO: Optional features available: ssl, http11 2016-01-19 10:23:16 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'cm dProj.spiders', 'SPIDER_MODULES': ['cmdProj.spiders'], 'BOT_NAME': 'cmdProj'} 2016-01-19 10:23:17 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsol e, LogStats, CoreStats, SpiderState 2016-01-19 10:23:17 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2016-01-19 10:23:17 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2016-01-19 10:23:17 [scrapy] INFO: Enabled item pipelines: 2016-01-19 10:23:17 [scrapy] INFO: Spider opened 2016-01-19 10:23:17 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i tems (at 0 items/min) 2016-01-19 10:23:17 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2016-01-19 10:23:17 [scrapy] DEBUG: Crawled (200) <GET http://www.polyhousestore .com/catalogsearch/result/?cat=&q=lc+60> (referer: None) ------------- No products from sel.xpath 2016-01-19 10:23:18 [scrapy] INFO: Closing spider (finished) 2016-01-19 10:23:18 [scrapy] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 254, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 16091, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2016, 1, 19, 8, 23, 18, 53000), 'log_count/DEBUG': 2, 'log_count/INFO': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2016, 1, 19, 8, 23, 17, 376000)} 2016-01-19 10:23:18 [scrapy] INFO: Spider closed (finished)
感谢您的帮助
谢谢你的回答 – Ron
我试着去了解我哪里出错了?我从铬检查中得到了路径,所以我不明白为什么它不适用于我?而且你是否得到了物品 - 我没有找到它? – Ron
我只是打开你的网站的Chrome检查器,选择其中一个项目,它是''div'与'item-inner'类。当然,项目内容周围有一层'div'标签,因此您可以优化XPath,但我只想显示搜索的位置。 – GHajba