2017-08-26 75 views
0

我试图使用Selenium和Scrapy(请参阅下面的代码)来抓取英国着名零售商的网站。我得到一个[scrapy.core.scraper] ERROR: Spider error processing,不知道还有什么要做的(一直呆了三个小时左右)。感谢你的支持。Scrapy + Selenium问题

import scrapy 
from selenium import webdriver 
from nl_scrape.items import NlScrapeItem 
import time 

class ProductSpider(scrapy.Spider): 
    name = "product_spider" 
    allowed_domains = ['newlook.com'] 
    start_urls = ['http://www.newlook.com/uk/womens/clothing/c/uk-womens-clothing?comp=NavigationBar%7Cmn%7Cwomens%7Cclothing#/?q=:relevance&page=1&sort=relevance&content=false'] 

def __init__(self): 
    self.driver = webdriver.Safari() 
    self.driver.set_window_size(800,600) 
    time.sleep(4) 

def parse(self, response): 
    self.driver.get(response.url) 
    time.sleep(4) 

    # Collect products 
    products = driver.find_elements_by_class_name('plp-item ng-scope') 

    # Iterate over products; extract data and append individual features to NlScrapeItem 
    for item in products: 

     # Pull features 
     desc = item.find_element_by_class_name('product-item__name link--nounderline ng-binding').text 
     href = item.find_element_by_class_name('plp-carousel__img-link ng-scope').get_attribute('href') 

     # Price Symbol removal and integer conversion 
     priceString = item.find_element_by_class_name('price ng-binding').text 
     priceInt = priceString.split('£')[1] 
     price = float(priceInt) 

     # Generate a product identifier 
     identifier = href.split('/p/')[1].split('?comp')[0] 
     identifier = int(identifier) 

     # datetime 
     dt = date.today() 
     dt = dt.isoformat() 

     # NlScrapeItem 
     item = NlScrapeItem() 

     # Append product to NlScrapeItem 
     item['id'] = identifier 
     item['href'] = href 
     item['description'] = desc 
     item['price'] = price 
     item['firstSighted'] = dt 
     item['lastSighted'] = dt 
     yield item 

    self.driver.close() 

2017-08-26 15:48:38 [scrapy.core.scraper] ERROR: Spider error processing http://www.newlook.com/uk/womens/clothing/c/uk-womens-clothing?comp=NavigationBar%7Cmn%7Cwomens%7Cclothing#/?q=:relevance&page=1&sort=relevance&content=false> (referer: None)

Traceback (most recent call last): File "/Users/username/Documents/nl_scraping/nl_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/Users/username/Documents/nl_scraping/nl_scrape/nl_scrape/spiders/product_spider.py", line 18, in parse products = driver.find_elements_by_class_name('plp-item ng-scope') NameError: name 'driver' is not defined

+0

尝试使用,产品= self.driver.find_elements_by_class_name( 'PLP项NG-范围'),并让我们看看它的工作原理 – Kapil

+0

@Kapil:没有运气不幸的是:(_ERROR:蜘蛛错误processing_盛行 – Philipp

+0

做ATLEAST Safari浏览器开始? – Kapil

回答

2

所以,你的代码有两个问题

def parse(self, response): 
    self.driver.get(response.url) 
    time.sleep(4) 

    # Collect products 
    products = driver.find_elements_by_class_name('plp-item ng-scope') 

很方便你改变self.driver只是driver。不这样工作。您应该添加在

def parse(self, response): 
    driver = self.driver 
    driver.get(response.url) 
    time.sleep(4) 

    # Collect products 
    products = driver.find_elements_by_class_name('plp-item ng-scope') 

接下来,您必须在函数结束时使用self.driver.close()功能的顶部。因此,一旦处理一个网址,您将关闭浏览器。那是错的。所以删除该行。

+0

就是这样 - 非常感谢 - 我不能投票,因为我的声望还是太低了! – Philipp

+0

@Tarun Lalwani,虽然这不是我的主题,但如果我能知道该行的哪个位置'self.driver.close()',我会非常高兴。谢谢。 – SIM

+0

@Shahin,您需要听取'spider_closed'信号并在那里执行代码。有一个简单的例子来说明如何在这里连接到这个代码https://doc.scrapy.org/en/latest/topics/signals.html –