我是scrapy和Python的新手,所以我的问题可能很简单。通过使用现有的网站指南,我写了一个刮板,它刮掉网站的页面,并在输出文件中显示图像URL,名称和...。我想下载一个目录中的图像,但输出目录是空的!使用scrapy从网站下载并保存图像
这里是我的代码:
myspider.py
import scrapy
class BrickSetSpider(scrapy.Spider):
name = 'brick_spider`enter code here`'
start_urls = ['http://brickset.com/sets/year-2016']
def parse(self, response):
SET_SELECTOR = '.set'
for brickset in response.css(SET_SELECTOR):
NAME_SELECTOR = 'h1 a ::text'
PIECES_SELECTOR = './/dl[dt/text() = "Pieces"]/dd/a/text()'
MINIFIGS_SELECTOR = './/dl[dt/text() = "Minifigs"]/dd[2]/a/text()'
IMAGE_SELECTOR = 'img ::attr(src)'
yield {
'name': brickset.css(NAME_SELECTOR).extract_first(),
'pieces': brickset.xpath(PIECES_SELECTOR).extract_first(),
'minifigs': brickset.xpath(MINIFIGS_SELECTOR).extract_first(),
'image': brickset.css(IMAGE_SELECTOR).extract_first(),
}
NEXT_PAGE_SELECTOR = '.next a ::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)
settings.py
ITEM_PIPELINES = {'brickset.pipelines.BricksetPipeline': 1}
IMAGES_STORE = '/home/nmd/brickset/brickset/spiders/output'
#items.py
import scrapy
class BrickSetSpider(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
pass
你没't显示我们可能是最重要的部分,'brickset.pipelines.BricksetPipeline'类的代码。 –
你写的代码只抓取网站的数据,如图像SRC url.so使用yecked数据和下载图像使用wget –