2017-01-10 36 views
0

我试图刮零售服装购物网站。出于某种原因,每当我运行下面的代码时,我最终都会从三个类别(如parse()中定义的第n个子项)和一些来自li:nth-​​child(5)的项目中获得一些项目。Python Scrapy/Selenium正在跳过我的迭代大部分

有时会出现以下错误:

2017-01-09 20:33:30 [scrapy] ERROR: Spider error processing <GET http://www.example.com/jackets> (referer: http://www.example.com/) 
Traceback (most recent call last): 
    File "/usr/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback 
    yield next(it) 
    File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output 
    for x in result: 
    File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr> 
    return (_set_referer(r) for r in result or()) 
    File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "/Users/BeardedMac/projects/thecurvyline-scraper/spiders/example.py", line 47, in parse_items 
    price = node.find_element_by_css_selector('div.flex-wrapper--prod-details > div.pricing > div.price > div.standardprice').text 
    File "/usr/local/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 307, in find_element_by_css_selector 
    return self.find_element(by=By.CSS_SELECTOR, value=css_selector) 
    File "/usr/local/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 511, in find_element 
    {"using": by, "value": value})['value'] 
    File "/usr/local/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 494, in _execute 
    return self._parent.execute(command, params) 
    File "/usr/local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 236, in execute 
    self.error_handler.check_response(response) 
    File "/usr/local/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 192, in check_response 
    raise exception_class(message, screen, stacktrace) 
StaleElementReferenceException: Message: The element reference is stale. Either the element is no longer attached to the DOM or the page has been refreshed 

但是,如果我改变了第n个孩子选择的说法,李:第n个孩子(3),我得到这一类项目的转换,但我似乎无法立即让他们全部。

我对Python和Scrapy很新,所以我可能只是缺少一些元素。

def __init__(self): 
    self.driver = webdriver.Chrome('/MyPath/chromedriver') 
    self.driver.set_page_load_timeout(10) 

def parse(self, response): 
    for href in response.css('#main-menu > div > li:nth-child(n+3):nth-child(-n+6) > a::attr(href)').extract(): 
     yield scrapy.Request(response.urljoin(href), callback=self.parse_items) 

def get_item(self, response): 
    sizes = response.css('#pdpMain > div.productdetailcolumn.productinfo > div > div.variationattributes > div.swatches.size > ul > li > a::text').extract() 
    product_id = response.css('#riiratingsfavorites > div.riiratings > a::attr(rel)').extract_first() 
    response.meta['product']['sizes'] = sizes 
    response.meta['product']['product_id'] = product_id 
    yield response.meta['product'] 


def parse_items(self, response): 
    category = response.css('#shelf > div.category-header > h2::text').extract_first() 
    self.driver.get(response.url) 
    nodes = self.driver.find_elements_by_css_selector('#search > div.productresultarea > div.product.producttile') 
    for node in nodes: 
     self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") 
     time.sleep(5) 
     price = node.find_element_by_css_selector('div.flex-wrapper--prod-details > div.pricing > div.price > div.standardprice').text 
     images = node.find_element_by_css_selector('div.image > div.thumbnail > p > a > img:nth-child(1)').get_attribute('src') 
     name = node.find_element_by_css_selector('div.flex-wrapper--prod-details > div.name > a').text 
     product_url = node.find_element_by_css_selector('div.flex-wrapper--prod-details > div.name > a').get_attribute('href') 
     product = Product() 
     product['title'] = name 
     product['price'] = price 
     product['product_url'] = product_url 
     product['retailer'] = 'store7' 
     product['categories'] = category 
     product['images'] = images 
     product['sizes'] = [] 
     product['product_id'] = [] 
     product['base_url'] = '' 
     product_page = response.urljoin(product_url) 
     yield scrapy.Request(product_page, callback=self.get_item, meta={'product': product}) 

回答

0

说得很快 - 这里所发生是因为scrapy是并发的,你的硒执行不,你的硒司机会很困惑 - 你爬scrapy过程中不断要求你的硒驱动程序加载新的URL时,它是仍然与旧的工作。

要避免这种情况,您可以通过将CONCURRENT_REQUESTS设置为1来禁用并发性。例如。添加到您的settings.py文件:

CONCURRENT_REQUESTS = 1 

或在你的蜘蛛添加custom_settings项,如果你希望此设置限制到一个蜘蛛:

class MySpider(scrapy.Spider): 
    custom_settings = {'CONCURRENT_REQUESTS', 1} 

如果你想保持并发性(这是一个非常好的东西)你可以尝试用更友好的python技术替换硒,例如Splash