使用scrapy shell
您正在搜索的itemprop XPath是不能作为@furas说,部分内容是由JavaScript生成。您可以通过将Selenium添加到scrapy来获得此内容。 Selenium使用URL,使用Web浏览器呈现它,Scrapy可以正常访问生成的HTML。下面的代码是让您开始使用Firefox的框架,但它也适用于其他浏览器。我建议您也可以获得Firefox的Firebug,这对于练习xpaths很有用。
import scrapy
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.http import TextResponse
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
class SearchSpider(scrapy.Spider):
name = "search"
allowed_domains = ['www.somedomain.com']
start_urls = ['https://www.somewebsite.com']
def __init__(self, filename=None):
# wire us up to selenium
self.driver = webdriver.Firefox()
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
self.driver.close()
def parse(self, response):
item = someItem()
# Load the current page into Selenium
self.driver.get(response.url)
try:
WebDriverWait(self.driver, 30).until(EC.presence_of_element_located((By.XPATH, '//span[@itemprop="name"]')))
except TimeoutException:
item['status'] = 'timed out'
# Sync scrapy and selenium so they agree on the page we're looking at then let scrapy take over
resp = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
# scrape as normal
文档中没有元素使用'itemprop'属性。 –
如果内容是由JavaScript生成的,那么您无法找到它。 – furas