2017-02-15 66 views
0

我有一个网站名称https://www.grohe.com/in 在该网页我想获得一个类型的浴室水龙头https://www.grohe.com/in/25796/bathroom/bathroom-faucets/grandera/ 在该页面有多个产品/相关products.I想每个产品的网址和废料我写这样的data.For ...刮:嵌套的URL数据刮

我items.py文件看起来像

from scrapy.item import Item, Field 

class ScrapytestprojectItem(Item): 
    producturl=Field() 
    imageurl=Field() 
    description=Field() 

蜘蛛的代码是

import scrapy 
from ScrapyTestProject.items import ScrapytestprojectItem 
class QuotesSpider(scrapy.Spider): 
    name = "nestedurl" 
    allowed_domains = ['www.grohe.com'] 
    start_urls = [ 
    'https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/', 
    ] 

    def parse(self, response): 
    for divs in response.css('div.viewport div.workspace div.float-box'): 
     item = {'producturl': divs.css('a::attr(href)').extract(), 
       'imageurl': divs.css('a img::attr(src)').extract(), 
       'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()} 
     next_page = response.urljoin(item['producturl']) 
     yield scrapy.Request(next_page, callback=self.parse, meta={'item': item}) 

当我运行scrapy ** scrapy抓取nestedurl -o nestedurl.csv ** 创建空文件。 控制台是

2017-02-15 18:03:11 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024 
2017-02-15 18:03:13 [scrapy] DEBUG: Crawled (200) <GET https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/> (referer: None) 
2017-02-15 18:03:13 [scrapy] ERROR: Spider error processing <GET https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/> (referer: None) 
Traceback (most recent call last): 
File "/usr/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback 
yield next(it) 
     File "/usr/lib/python2.7/dist-  packages/scrapy/spidermiddlewares/offsite.py", line 28, in  process_spider_output 
    for x in result: 
     File "/usr/lib/python2.7/dist- packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr> 
     return (_set_referer(r) for r in result or()) 
     File "/usr/lib/python2.7/dist-  packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> 
     return (r for r in result or() if _filter(r)) 
     File "/usr/lib/python2.7/dist- packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
File "/home/pradeep/ScrapyTestProject/ScrapyTestProject/spiders/nestedurl.py", line 15, in parse 
    next_page = response.urljoin(item['producturl']) 
     File "/usr/lib/python2.7/dist-packages/scrapy/http/response/text.py", line 72, in urljoin 
    return urljoin(get_base_url(self), url) 
     File "/usr/lib/python2.7/urlparse.py", line 261, in urljoin 
    urlparse(url, bscheme, allow_fragments) 
    File "/usr/lib/python2.7/urlparse.py", line 143, in urlparse 
    tuple = urlsplit(url, scheme, allow_fragments) 
    File "/usr/lib/python2.7/urlparse.py", line 176, in urlsplit 
    cached = _parse_cache.get(key, None) 
    TypeError: unhashable type: 'list' 
    2017-02-15 18:03:13 [scrapy] INFO: Closing spider (finished) 
    2017-02-15 18:03:13 [scrapy] INFO: Dumping Scrapy stats: 
      {'downloader/request_bytes': 253, 
      'downloader/request_count': 1, 
     'downloader/request_method_count/GET': 1, 
      'downloader/response_bytes': 31063, 
    'downloader/response_count': 1, 
     'downloader/response_status_count/200': 1, 
      'finish_reason': 'finished', 
     'finish_time': datetime.datetime(2017, 2, 15, 12, 33, 13, 396542), 
     'log_count/DEBUG': 3, 
      'log_count/ERROR': 3, 
      'log_count/INFO': 7, 
      'response_received_count': 1, 
     'scheduler/dequeued': 1, 
     'scheduler/dequeued/memory': 1, 
      'scheduler/enqueued': 1, 
      'scheduler/enqueued/memory': 1, 
      'spider_exceptions/TypeError': 1, 
      'start_time': datetime.datetime(2017, 2, 15, 12, 33, 11, 568424)} 
      2017-02-15 18:03:13 [scrapy] INFO: Spider closed (finished) 

回答

0

我认为项目divs.css('a::attr(href)').extract()有时会返回这导致向里urlparse崩溃,因为它无法散列列表urljoin引线使用时的列表。

0

URL生成不正确。

您应该启用日志记录,并记录一些消息来调试您的代码。

import scrapy, logging 
from ScrapyTestProject.items import ScrapytestprojectItem 
class QuotesSpider(scrapy.Spider): 
    name = "nestedurl" 
    allowed_domains = ['www.grohe.com'] 
    start_urls = [ 
    'https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/', 
    ] 

    def parse(self, response): 
    for divs in response.css('div.viewport div.workspace div.float-box'): 
     item = {'producturl': divs.css('a::attr(href)').extract(), 
       'imageurl': divs.css('a img::attr(src)').extract(), 
       'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()} 
     next_page = response.urljoin(item['producturl']) 

     logging.info(next_page) # see what it prints in console. 

     yield scrapy.Request(next_page, callback=self.parse, meta={'item': item}) 
+0

生成的URL被像“/中/ 8257 /浴室/浴室-水龙头/本质/产品信息/产品= 19408-G145&颜色? = 000&material = 19408000'它应该附加到'www.grohe.in'网址然后它使得感觉 – mvnpgh

+0

loger info [https://www.grohe.com/in/8257/bathroom/bathroom-faucets/essence/product-详细信息/?product = 33623-G145&color = 000&material = 33623000] .... sameway多个url形成 – mvnpgh

+0

不,您可以手动加入URL,如“www.grohe.in”+ item ['producturl']' – Umair

0
item = {'producturl': divs.css('a::attr(href)').extract(), # <--- issue here 
      'imageurl': divs.css('a img::attr(src)').extract(), 
      'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()} 
    next_page = response.urljoin(item['producturl']) # <--- here item['producturl'] is a list 

为了解决这个问题使用.extract_first('')

item = {'producturl': divs.css('a::attr(href)').extract_fist(''), 
      'imageurl': divs.css('a img::attr(src)').extract_first(''), 
      'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()} 
    next_page = response.urljoin(item['producturl']) 
+0

在我的spider代码中,我使用了.extract_first()/。extract_first('').still同样的输出没有change.Samething我在scrapy shell中测试与.extract()它self.it似乎不错 – mvnpgh

+0

producturl就像---->/in/8257/bathroom/bathroom-faucets/essence/product-details /?product = 19408-G145&color = 000&material = 19408000之后我们形成链接为'https://www.grohe.com/in/8257/bathroom/bathroom-faucets/essence/product-details/?product=19408-G145&color=000&material=19408000' – mvnpgh