LinkExtractor - 与条件提取

-1

我可以采用URL，然后履带遵循每个URL的起始网址的下一页链接及其工作LinkExtractor - 与条件提取

rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="pagnNext"]',)), callback="parse_start_url", follow= True),)

但是你可以想像我开始在一些获得验证码指向一些网址。我听说可能有蜜罐对人类来说是不可见的，但是在设计的html代码中，可以让你点击以识别你是一个bot。

我要让提取提取链接有条件例如不提取，如果CSS样式显示点击：不存在或类似的东西

是这是可行的

来源

2017-03-03 Can Gokalp

不确定你在问什么 – Umair

我会做这样的事情：

def parse_page1(self, response): 
    if (response.css("thing i want to check exists")) 
     return scrapy.Request(response.xpath('//a[@class="pagnNext"]'), 
          callback=self.parse_page2) 

def parse_page2(self, response): 
    # this would log http://www.example.com/some_page.html 
    self.logger.info("Visited %s", response.url)

官方文档： https://doc.scrapy.org/en/latest/topics/request-response.html

注：如您captc哈问题尝试搞乱您的设置。至少要确保你的DOWNLOAD_DELAY设置为0以外的东西。看看其他选项https://doc.scrapy.org/en/latest/topics/settings.html

来源

2017-03-06 16:54:54

LinkExtractor - 与条件提取

回答

相关问题