2016-01-04 56 views
-4

我想从 https://www.proptiger.com/noida/knowledge-park-v/supertech-sports-village-665980中刮取数据。Scrapy不解析整个HTML内容

但是,当执行命令response.xpath('//span')时,它不会返回所有span标记。当我执行response.xpath('//span[@itemprop="name"]'),它返回空数组。

>>> response.xpath('//span[@itemprop="name"]') 
[] 
+0

文档中没有元素使用'itemprop'属性。 –

+0

如果内容是由JavaScript生成的,那么您无法找到它。 – furas

回答

1

使用scrapy shell您正在搜索的itemprop XPath是不能作为@furas说,部分内容是由JavaScript生成。您可以通过将Selenium添加到scrapy来获得此内容。 Selenium使用URL,使用Web浏览器呈现它,Scrapy可以正常访问生成的HTML。下面的代码是让您开始使用Firefox的框架,但它也适用于其他浏览器。我建议您也可以获得Firefox的Firebug,这对于练习xpaths很有用。

import scrapy 
from scrapy import signals 
from scrapy.xlib.pydispatch import dispatcher 
from scrapy.http import TextResponse 

from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.wait import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException 

class SearchSpider(scrapy.Spider): 
    name = "search" 

    allowed_domains = ['www.somedomain.com'] 
    start_urls = ['https://www.somewebsite.com'] 

    def __init__(self, filename=None): 
     # wire us up to selenium 
     self.driver = webdriver.Firefox() 
     dispatcher.connect(self.spider_closed, signals.spider_closed) 

    def spider_closed(self, spider): 
     self.driver.close() 

    def parse(self, response): 
     item = someItem() 

     # Load the current page into Selenium 
     self.driver.get(response.url) 

     try: 
      WebDriverWait(self.driver, 30).until(EC.presence_of_element_located((By.XPATH, '//span[@itemprop="name"]'))) 
     except TimeoutException: 
      item['status'] = 'timed out' 

     # Sync scrapy and selenium so they agree on the page we're looking at then let scrapy take over 
     resp = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8') 
     # scrape as normal