重新请求从python scrapy parse（）中的URL或URL

我有一个简单的脚本，从亚马逊刮取数据，你都知道有一个验证码，所以当captcha到达页面标题是'机器人检查'，所以我写逻辑对于这种情况，如果页面title = 'Robot check'和打印消息'页面不被抓取，页面上有验证码'，并且不从该页面获取数据。否则继续脚本。重新请求从python scrapy parse（）中的URL或URL

但在if部分我试过yield scrapy.Request(response.url, callback=self.parse)重新请求当前的URL，但我没有成功。我只需要做的是重新请求response.url并继续脚本，因为这是因为我认为我必须做的就是从日志文件中删除response.url，所以scrapy不记得网址为抓取简单我必须欺骗scrapy并请求再次相同的URL或可能是如果有方法将response.url标记为失败的网址，以便scrapy自动重新请求。

下面是一个简单的脚本，start_urls是在同一个文件夹单独命名的URL的文件，所以我必须从URL中导入它的文件

import scrapy 
import re 
from urls import start_urls 

class AmazondataSpider(scrapy.Spider): 
    name = 'amazondata' 
    allowed_domains = ['https://www.amazon.co.uk'] 
    def start_requests(self): 
     for x in start_urls: 
      yield scrapy.Request(x, self.parse) 

    def parse(self, response): 
     try: 
      if 'Robot Check' == str(response.xpath('//title/text()').extract_first().encode('utf-8')): 
       print '\n\n\n The ROBOT CHeCK Page This link is reopening......\n\n\n' 
       print 'URL : ',response.url,'\n\n' 
       yield scrapy.Request(response.url, callback=self.parse) 
      else: 
       print '\n\nThere is a data in this page no robot check or captcha\n\n' 
       pgtitle = response.xpath('//title/text()').extract_first().encode('utf-8') 
       print '\n\n\nhello', pgtitle,'\n\n\n' 
       if pgtitle == 'Robot check: 
        # LOGIC FOR GET DATA BY XPATH on RESPONSE 
     except Exception as e: 
      print '\n\n\n\n',e,'\n\n\n\n\n'

来源

2017-06-18 Manthankuamr

告诉Scrapy不过滤掉重复的链接，因为默认情况下Scrapy如果已经访问并且已经收到http_status 200，则不访问该链接。

做dont_filter=True

在你的情况，

print '\n\n\n The ROBOT CHeCK Page This link is reopening......\n\n\n' 
print 'URL : ',response.url,'\n\n' 
yield scrapy.Request(response.url, callback=self.parse, dont_filter=True)

来源

2017-06-18 07:04:15 Umair

重新请求从python scrapy parse（）中的URL或URL

回答

相关问题