Scrapy网页抓取工具无法抓取链接

我是Scrapy的新手。在这里，我的蜘蛛爬行twistedweb。Scrapy网页抓取工具无法抓取链接

class TwistedWebSpider(BaseSpider): 

    name = "twistedweb3" 
    allowed_domains = ["twistedmatrix.com"] 
    start_urls = [ 
     "http://twistedmatrix.com/documents/current/web/howto/", 
    ] 
    rules = (
     Rule(SgmlLinkExtractor(), 
      'parse', 
      follow=True, 
     ), 
    ) 
    def parse(self, response): 
     print response.url 
     filename = response.url.split("/")[-1] 
     filename = filename or "index.html" 
     open(filename, 'wb').write(response.body)

当我运行scrapy-ctl.py crawl twistedweb3时，它只提取。

获取index.html内容，我尝试使用SgmlLinkExtractor，它提取链接，如我所料，但不能遵循这些链接。

你能告诉我我要去哪里吗？

假设我想获得css，javascript文件。我如何实现这一目标？我的意思是让完整的网站？

来源

2010-08-19 Iapilgrim

你还没有在这里显示足够的代码，甚至猜测你的问题是什么。我建议你完成好Scrapy教程，然后你的问题要么自己回答，要么你可以解释问题是什么。 http://doc.scrapy.org/intro/tutorial.html – msw 2010-08-19 02:49:15

我确实按照教程。我在上面看到了一点蜘蛛。 – Iapilgrim 2010-08-20 06:08:02

rules属性属于CrawlSpider。使用class MySpider(CrawlSpider)。此外，当您使用CrawlSpider时，您不得覆盖parse方法，而改用parse_response或其他类似的名称。

来源

2010-08-19 04:58:53 Rolando

感谢Rho。你救了我一天。它按照您的建议修改后生效 – Iapilgrim 2010-08-20 06:11:19

Scrapy网页抓取工具无法抓取链接

回答

相关问题