0
我是否无法抓取此网站? :为什么我不能抓取这个网站与Scrapy
我尝试了很容易scrapy代码,看看我是否可以从网站上的东西,但无论我尝试我什么都得不到..
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.log import *
from vacatures.settings import *
from vacatures.items import *
from scrapy.http import Request
class VacaturesSpider(CrawlSpider):
name = 'vacatures_spider'
allowed_domains = ['www.itbanen.nl']
start_urls = ['http://www.itbanen.nl/vacature/zoeken/overzicht/wijzigingsdatum/query//distance/30/output/html/items_per_page/15/page/1/ignore_ids']
def parse(self, response):
self.log('Nieuwe pagina! %s' % response.url)
#hxs = HtmlXPathSelector(response)
sel = Selector(response)
# HXS to find url that goes to detail page
test = sel.xpath('//div[@id="resultlist"]/div[@class="resultlist"]/h2/text()').extract()
print test
links = sel.xpath('//div[@class="container"]/h2/text()')
print links
for link in links:
link_item = link.extract()
print link_item
#yield Request(complete_url(link_item), callback=self.parse_category)
你也许可以先检查'response'来找出你最近得到了什么? –
你确定你的XPath表达式正确吗?我没有看到您的表达式与页面中的元素匹配。你可以用'sel.css('div#resultlist div.resultlist h2 :: text')'和'sel.css('div.container h2 :: text')''来使用CSS选择器,例如 –
我的检查是尝试从页面获取内容,但即使使用这个简单的脚本,我也不会收回任何内容,并且如果我在其他网站上运行此脚本(作为测试),它确实有效? – Beer