0
我第一次尝试scrapy CrawlSpider子类。我创建了强烈的基础上,文档例如下面的蜘蛛在https://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider-example:Scrapy爬行器设置规则
class Test_Spider(CrawlSpider):
name = "test"
allowed_domains = ['http://www.dragonflieswellness.com']
start_urls = ['http://www.dragonflieswellness.com/wp-content/uploads/2015/09/']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
# Rule(LinkExtractor(allow=('category\.php',), deny=('subsection\.php',))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow='.jpg'), callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
print(response.url)
我试图让蜘蛛循环开始在prescibed目录,然后提取了所有的“.JPG”链接目录,但我看到:
2016-09-29 13:07:35 [scrapy] INFO: Spider opened
2016-09-29 13:07:35 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-09-29 13:07:35 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-09-29 13:07:36 [scrapy] DEBUG: Crawled (200) <GET http://www.dragonflieswellness.com/wp-content/uploads/2015/09/> (referer: None)
2016-09-29 13:07:36 [scrapy] INFO: Closing spider (finished)
我该如何得到这个工作?
谢谢,这确实有帮助,但我仍然试图了解它是如何工作的。我想在这种情况下下载jpg文件,所以我可以要求一个例子,包括管道功能? – user61629
看看我编辑的答案。 – mihal277
感谢您抽出时间。我有兴趣看到人们采取不同的方法。 – user61629