0
您好我正在使用scrapy抓取网站新闻,但我得到错误,当我这样做的过程中,网站有很多新闻页面和新闻的网址是www.example.com/34223我试图找到解决这个问题的一种方法,她是我的代码scrapy版本是1.4.0,我用MACOSscrapy抓取网站
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class Example(scrapy.Spider):
name = "example"
allowed_domains = ["http://www.example.com"]
start_urls = ["http://www.example.com"]
rules = (
#self.log('testing rules' + response.url)
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('/*',), deny=(' ',))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('item\.php',)), callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = scrapy.Item()
item['title'] = response.xpath('/html/body/div[3]/div/div/div[1]/div[1]/div/div[2]/text()').extract()
item['img_url'] = response.xpath('/html/body/div[3]/div/div/div[1]/div[1]/div/div[3]/img').extract()
item['description'] = response.xpath('/html/body/div[3]/div/div/div[1]/div[1]/div/div[5]/text()').extract()
return item
当我运行代码时出现此错误错误:Spider错误处理(引用:无) – Raed
将allowed_domains = [“http://www.example.com”]更改为'allowed_domains = [“www .example.com“]',看看它是否有效 –