2016-12-06 68 views
2

根据Scrapy Documetions我要抓取和从几个地点抽取数据,我的代码与常规网站工作正常,但是当我要抓取的网站Sucuri我不明白任何数据,似乎sucuri防火墙阻止我访问网站标记。如何凑一个网站sucuri保护

目标网站是http://www.dwarozh.net/和 这是我的蜘蛛片段

from scrapy import Spider 
from scrapy.selector import Selector 
import scrapy 

from Stack.items import StackItem 
from bs4 import BeautifulSoup 
from scrapy import log 
from scrapy.utils.response import open_in_browser 


    class StackSpider(Spider): 
     name = "stack" 
     start_urls = [ 
      "http://www.dwarozh.net/sport/", 
     ] 


     def parse(self, response): 
      mItems = Selector(response).xpath('//div[@class="news-more-img"]/ul/li') 
      for mItem in mItems: 
       item = StackItem() 
       item['title'] = mItem.xpath(
        'a/h2/text()').extract_first() 
       item['url'] = mItem.xpath(
        'viewa/@href').extract_first() 
       yield item 

这是结果,我在响应得到

<html><title>You are being redirected...</title> 
<noscript>Javascript is required. Please enable javascript before you are allowed to see this page.</noscript> 
<script>var s={},u,c,U,r,i,l=0,a,e=eval,w=String.fromCharCode,sucuri_cloudproxy_js='',S='cz0iMHNlYyIuc3Vic3RyKDAsMSkgKyAnNXlCMicuc3Vic3RyKDMsIDEpICsgJycgKycnKyIxIi5zbGljZSgwLDEpICsgJ2pQYycuY2hhckF0KDIpKyJmIiArICIiICsnbz1jJy5jaGFyQXQoMikrICcnICsgCiI0Ii5zbGljZSgwLDEpICsgJ0FvPzcnLnN1YnN0cigzLCAxKSArIjUiICsgU3RyaW5nLmZyb21DaGFyQ29kZSgxMDIpICsgIiIgKycxJyArICAgJycgKyAKIjFzZWMiLnN1YnN0cigwLDEpICsgICcnICsnJysnMycgKyAgImUiLnNsaWNlKDAsMSkgKyAiIiArImZzdSIuc2xpY2UoMCwxKSArICIiICsiMnN1Y3VyIi5jaGFyQXQoMCkrICcnICtTdHJpbmcuZnJvbUNoYXJDb2RlKDEwMCkgKyAgJycgKyI5c3UiLnNsaWNlKDAsMSkgKyAgJycgKycnKyI2IiArICdDYycuc2xpY2UoMSwyKSsiNnN1Ii5zbGljZSgwLDEpICsgJ2YnICsgICAnJyArIAonYScgKyAgIjAiICsgJ2YnICsgICI0IiArICI2c2VjIi5zdWJzdHIoMCwxKSArICAnJyArIAonWnBFMScuc3Vic3RyKDMsIDEpICsiMSIgKyBTdHJpbmcuZnJvbUNoYXJDb2RlKDB4MzgpICsgIiIgKyI1c3VjdXIiLmNoYXJBdCgwKSsiZnN1Ii5zbGljZSgwLDEpICsgJyc7ZG9jdW1lbnQuY29va2llPSdzc3VjJy5jaGFyQXQoMCkrICd1JysnJysnYycuY2hhckF0KDApKyd1c3VjdXInLmNoYXJBdCgwKSsgJ3JzdWMnLmNoYXJBdCgwKSsgJ3N1Y3VyaScuY2hhckF0KDUpICsgJ19zdScuY2hhckF0KDApICsnY3N1Y3VyJy5jaGFyQXQoMCkrICdsJysnbycrJ3UnLmNoYXJBdCgwKSsnZCcrJ3AnKycnKydyc3VjdScuY2hhckF0KDApICArJ3NvJy5jaGFyQXQoMSkrJ3gnKyd5JysnX3N1Y3VyaScuY2hhckF0KDApICsgJ3UnKyd1JysnaXN1Y3VyaScuY2hhckF0KDApICsgJ3N1Y3VkJy5jaGFyQXQoNCkrICdzXycuY2hhckF0KDEpKycxJysnOCcrJzEnKydzdWN1cmQnLmNoYXJBdCg1KSArICdlJy5jaGFyQXQoMCkrJzEnKydzdWN1cjEnLmNoYXJBdCg1KSArICcxc3VjdXJpJy5jaGFyQXQoMCkgKyAnMicrIj0iICsgcyArICc7cGF0aD0vO21heC1hZ2U9ODY0MDAnOyBsb2NhdGlvbi5yZWxvYWQoKTs=';L=S.length;U=0;r='';var A='ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz/';for(u=0;u<64;u++){s[A.charAt(u)]=u;}for(i=0;i<L;i++){c=s[S.charAt(i)];U=(U<<6)+c;l+=6;while(l>=8){((a=(U>>>(l-=8))&0xff)||(i<(L-2)))&&(r+=w(a));}}e(r);</script></html> 

我如何可以绕过sucuri与scrapy?

+0

你可以重写JavaScript代码到Python,并用它来解码加密的数据。 – furas

+0

@furas怎么样?请提供更多的细节,如果你能为您的评论 – zhilevan

+1

简单的一个片段,进去标签的代码,并对其进行评估。它基本上设置一个cookie这样的*'sucuri_cloudproxy_uuid_181de1112 = 021zf6475f112ez2d96c6fa0f411183f;路径= /;最大年龄= 86400' *和重新加载窗口/标签/页。 –

回答

4

网站使用cookie-和用户代理为基础的保护。你可以用这种方式来检查它。在Chrome中打开DevTools。导航到http://www.dwarozh.net/sport/目标页面,然后在网络选项卡中右键点击请求页和“复制为卷曲” 打开控制台,然后运行卷曲:

$ curl 'http://www.dwarozh.net/sport/all-hawal.aspx?cor=3&Nawnishan=%D9%88%DB%95%D8%B1%D8%B2%D8%B4%DB%95%DA%A9%D8%A7%D9%86%DB%8C%20%D8%AF%DB%8C%DA%A9%DB%95' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4,es;q=0.2' -H 'Upgrade-Insecure-Requests: 1' -H 'X-Compress: null' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Referer: http://www.dwarozh.net/sport/details.aspx?jimare=10505' -H 'Cookie: __cfduid=dc9867; sucuri_cloudproxy_uuid_ce28bca9c=d36ad9; ASP.NET_SessionId=wqdo0v; __atuvc=1%7C49; sucuri_cloudproxy_uuid_0d5c=6ab0; _gat=1; __asc=7c0b5; __auc=35; _ga=GA1.2.19688' -H 'Connection: keep-alive' --compressed 

你会看到正常的HTML代码。如果您从请求中删除User-Agent的cookie,您将获得cap page。

让我们检查一下在scrapy:

$ scrapy shell 
>>> from scrapy import Request 
>>> cookie_str = '''here; your; cookies; from; browser; go;''' 
>>> cookies = dict(pair.split('=') for pair in cookie_str.split('; ')) 
>>> cookies # check them 
{'__auc': '999', '__cfduid': '796', '_gat': '1', '__atuvc': '1%7C49', 'sucuri_cloudproxy_uuid_0d5c97a96': '6ab007eb1 
9', 'ASP.NET_SessionId': 'u9', '_ga': 'GA1.2.1968.148', '__asc': 'sfsdf', 'sucuri_cloudproxy_uuid_ce2 
sfsdfs': 'sdfsdf'} 
>>> r = Request(url='http://www.dwarozh.net/sport/', cookies=cookies, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/56 (KHTML, like Gecko) Chrome 
/54. Safari/5'}) 
>>> fetch(r) 
>>> response.xpath('//div[@class="news-more-img"]/ul/li') 
[<Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="details.aspx?jimare=10507">'>, <Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="de 
tails.aspx?jimare=10505">'>, <Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="details.aspx?jimare=10504">'>, <Selector xpath='//div[@class="news-more-img"]/ 
ul/li' data='<li><a href="details.aspx?jimare=10503">'>, <Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="details.aspx?jimare=10323">'>] 

好极了!让我们做一个蜘蛛:

我已经修改了你的,因为我有一些部件没有源代码。

from scrapy import Spider, Request 
from scrapy.selector import Selector 
import scrapy 

#from Stack.items import StackItem 
#from bs4 import BeautifulSoup 
from scrapy import log 
from scrapy.utils.response import open_in_browser 


class StackSpider(Spider): 
     name = "dwarozh" 
     start_urls = [ 
      "http://www.dwarozh.net/sport/", 
     ] 
     _cookie_str = '''__cfduid=dc986; sucuri_cloudproxy_uuid_ce=d36a; ASP.NET_SessionId=wq; __atuvc=1%7C49; sucuri_cloudproxy_uuid_0d5c97a96=6a; _gat=1; __asc=7c0b; __auc=3; _ga=GA1.2.196.14''' 
     _user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/5 (KHTML, like Gecko) Chrome/54 Safari/5' 

     def start_requests(self): 
      cookies = dict(pair.split('=') for pair in self._cookie_str.split('; ')) 
      return [Request(url=url, cookies=cookies, headers={'User-Agent': self._user_agent}) 
        for url in self.start_urls] 

     def parse(self, response): 
      mItems = Selector(response).xpath('//div[@class="news-more-img"]/ul/li') 
      for mItem in mItems: 
       item = {} # StackItem() 
       item['title'] = mItem.xpath('a/h2/text()').extract_first() 
       item['url'] = mItem.xpath('viewa/@href').extract_first() 
       yield {'url': item['url'], 'title': item['title']} 

让运行:

$ scrapy crawl dwarozh -o - -t csv --loglevel=DEBUG 
/Users/el/Projects/scrap_woman/.env/lib/python3.4/importlib/_bootstrap.py:321: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more. 
    return f(*args, **kwds) 
2016-12-10 00:18:55 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrap1) 
2016-12-10 00:18:55 [scrapy] INFO: Overridden settings: {'SPIDER_MODULES': ['scrap1.spiders'], 'FEED_FORMAT': 'csv', 'BOT_NAME': 'scrap1', 'FEED_URI': 'stdout:', 'NEWSPIDER_MODULE': 'scrap1.spiders', 'ROBOTSTXT_OBEY': True} 
2016-12-10 00:18:55 [scrapy] INFO: Enabled extensions: 
['scrapy.extensions.corestats.CoreStats', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.feedexport.FeedExporter', 
'scrapy.extensions.logstats.LogStats'] 
2016-12-10 00:18:55 [scrapy] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2016-12-10 00:18:55 [scrapy] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2016-12-10 00:18:55 [scrapy] INFO: Enabled item pipelines: 
[] 
2016-12-10 00:18:55 [scrapy] INFO: Spider opened 
2016-12-10 00:18:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-12-10 00:18:55 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024 
2016-12-10 00:18:55 [scrapy] DEBUG: Crawled (200) <GET http://www.dwarozh.net/robots.txt> (referer: None) 
2016-12-10 00:18:56 [scrapy] DEBUG: Crawled (200) <GET http://www.dwarozh.net/sport/> (referer: None) 
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/> 
{'url': None, 'title': '\nلیستی یاریزانانی ریاڵ مەدرید بۆ یاری سبەی ڕاگەیەنراو پێنج یاریزان دورخرانەوە'} 
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/> 
{'url': None, 'title': '\nهەواڵێکی ناخۆش بۆ هاندەرانی ریاڵ مەدرید'} 
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/> 
{'url': None, 'title': '\nگرنگترین مانشێتی ئەمرۆ هەینی رۆژنامەکانی ئیسپانیا'} 
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/> 
{'url': None, 'title': '\nبەفەرمی یۆفا پێكهاتەی نموونەی جەولەی شەشەم و کۆتایی چامپیۆنس لیگی بڵاو کردەوە'} 
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/> 
{'url': None, 'title': '\nكچە یاریزانێك دەبێتە هۆیی دروست بوونی تیپێكی تۆكمە'} 
2016-12-10 00:18:56 [scrapy] INFO: Closing spider (finished) 
2016-12-10 00:18:56 [scrapy] INFO: Stored csv feed (5 items) in: stdout: 
2016-12-10 00:18:56 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 950, 
'downloader/request_count': 2, 
'downloader/request_method_count/GET': 2, 
'downloader/response_bytes': 15121, 
'downloader/response_count': 2, 
'downloader/response_status_count/200': 2, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 12, 9, 21, 18, 56, 271371), 
'item_scraped_count': 5, 
'log_count/DEBUG': 8, 
'log_count/INFO': 8, 
'response_received_count': 2, 
'scheduler/dequeued': 1, 
'scheduler/dequeued/memory': 1, 
'scheduler/enqueued': 1, 
'scheduler/enqueued/memory': 1, 
'start_time': datetime.datetime(2016, 12, 9, 21, 18, 55, 869851)} 
2016-12-10 00:18:56 [scrapy] INFO: Spider closed (finished) 
url,title 
," 
لیستی یاریزانانی ریاڵ مەدرید بۆ یاری سبەی ڕاگەیەنراو پێنج یاریزان دورخرانەوە" 
," 
هەواڵێکی ناخۆش بۆ هاندەرانی ریاڵ مەدرید" 
," 
گرنگترین مانشێتی ئەمرۆ هەینی رۆژنامەکانی ئیسپانیا" 
," 
بەفەرمی یۆفا پێكهاتەی نموونەی جەولەی شەشەم و کۆتایی چامپیۆنس لیگی بڵاو کردەوە" 
," 
كچە یاریزانێك دەبێتە هۆیی دروست بوونی تیپێكی تۆكمە" 

可能你将有更新的cookies不时。您可以为此使用PhantomJS。

UPDATE

如何使用PhantomJS cookies来获得。

  1. 安装PhantomJS

  2. 制作这样的脚本dwarosh.js

    var page = require('webpage').create(); 
    page.settings.userAgent = 'SpecialAgent'; 
    page.open('http://www.dwarozh.net/sport/', function(status) { 
        console.log("Status: " + status); 
        if(status === "success") { 
        page.render('example.png'); 
        page.evaluate(function() { 
        return document.title; 
        }); 
        } 
        for (var i=0; i<page.cookies.length; i++) { 
        var c = page.cookies[i]; 
        console.log(c.name, c.value); 
        }; 
        phantom.exit(); 
    }); 
    
  3. 运行脚本:

    $ phantomjs --cookies-file=cookie.txt dwarosh.js 
        TypeError: undefined is not an object (evaluating 'activeElement.position().left') 
    
        http://www.dwarozh.net/sport/js/script.js:5 
        https://code.jquery.com/jquery-1.10.2.min.js:4 in c 
        https://code.jquery.com/jquery-1.10.2.min.js:4 in fireWith 
        https://code.jquery.com/jquery-1.10.2.min.js:4 in ready 
        https://code.jquery.com/jquery-1.10.2.min.js:4 in q 
    Status: success 
    __auc 250ab0a9158ee9e73eeeac78bba 
    __asc 250ab0a9158ee9e73eeeac78bba 
    _gat 1 
    _ga GA1.2.260482211.1481472111 
    ASP.NET_SessionId vs1utb1nyblqkxprxgazh0g2 
    sucuri_cloudproxy_uuid_3e07984e4 26e4ab3... 
    __cfduid d9059962a4c12e0f....1 
    
  4. 获取饼干sucuri_cloudproxy_uuid_3e07984e4并尝试与curl和相同的用户代理,以获得页面。

    $ curl -v http://www.dwarozh.net/sport/ -b sucuri_cloudproxy_uuid_3e07984e4=26e4ab377efbf766d4be7eff20328465 -A SpecialAgent 
    * Trying 104.25.209.23... 
    * Connected to www.dwarozh.net (104.25.209.23) port 80 (#0) 
    > GET /sport/ HTTP/1.1 
    > Host: www.dwarozh.net 
    > User-Agent: SpecialAgent 
    > Accept: */* 
    > Cookie:  sucuri_cloudproxy_uuid_3e07984e4=26e4ab377efbf766d4be7eff20328465 
    > 
    < HTTP/1.1 200 OK 
    < Date: Sun, 11 Dec 2016 16:17:04 GMT 
    < Content-Type: text/html; charset=utf-8 
    < Transfer-Encoding: chunked 
    < Connection: keep-alive 
    < Set-Cookie: __cfduid=d1646515f5ba28212d4e4ca562e2966311481473024; expires=Mon, 11-Dec-17 16:17:04 GMT; path=/; domain=.dwarozh.net; HttpOnly 
    < Cache-Control: private 
    < Vary: Accept-Encoding 
    < Set-Cookie: ASP.NET_SessionId=srxyurlfpzxaxn1ufr0dvxc2; path=/; HttpOnly 
    < X-AspNet-Version: 4.0.30319 
    < X-XSS-Protection: 1; mode=block 
    < X-Frame-Options: SAMEORIGIN 
    < X-Content-Type-Options: nosniff 
    < X-Sucuri-ID: 15008 
    < Server: cloudflare-nginx 
    < CF-RAY: 30fa3ea1335237b0-ARN 
    < 
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
    <html xmlns="http://www.w3.org/1999/xhtml"> 
    <head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><title> 
    Dwarozh : Sport 
    </title><meta content="دواڕۆژ سپۆرت هەواڵی ناوخۆ،هەواڵی جیهانی، وەرزشەکانی دیکە" name="description"/><meta property="fb:app_id" content="1713056075578566"/><meta content="initial-scale=1.0, width=device-width, maximum-scale=1.0, user-scalable=no" name="viewport"/><link href="wene/favicon.ico" rel="shortcut icon" type="image/x-icon"/><link href="wene/style.css" rel="stylesheet" type="text/css"/> 
    <script src="js/jquery-2.1.1.js" type="text/javascript"></script> 
    <script src="https://code.jquery.com/jquery-1.10.2.min.js" type="text/javascript"></script> 
    <script src="js/script.js" type="text/javascript"></script> 
    <link href="css/styles.css" rel="stylesheet"/> 
    <script src="js/classie.js" type="text/javascript"></script> 
    <script type="text/javascript"> 
    
+0

我尝试这种在这种情况下,首先打开网站chorme,我操作系统是MAC,然后抓住饼干和在你的字符串与变化饼干'_cookie_str = ''” sucuri_cloudproxy_uuid_328445b41 = 5996855a2e6fb90c76a6f6c8666626cc; ASP.NET_SessionId = 0ihhn0geh01tupszduatzhvq; __cfduid = d44ad0c2103e825786f4a99bbc2d099c51481196678; __asc = 31cac9b4158e861268f99516a8b; __auc = d4f610c3158e038467b56753acb; _ga = GA1.2.714700409.1481230534; _gat = 1''' _user_agent ='Mozilla/5.0(Macintosh; Intel Mac OS X 10_12_1)AppleWebKit/5(KHTML,如Gecko)Chrome/54 Safari/5''但没有结果:( – zhilevan

+0

cURL是否会返回正常的html页面? –

+0

该站点要求2周的Cookie:第一被安装在第一请求作为仅Http,也浏览器接收的一个模糊JavaScript代码块,decifering后和'eval'的代码设置另一个cookie,并重新加载的页面。 然后,有了这两个cookie,你可以打开一个页面。 'document.cookie “sucuri_cloudproxy_uuid_3be923a3e = da ...; sucuri_cloudproxy_uuid_0d5c97a96 = f43 ...”' –

2

解析动态内容的一般解决方案将首先通过使用能够运行Javascript的东西(例如http://phantomjs.org/)获取呈现的dom/html,然后保存html并将其提供给解析器。

这也将有助于绕过一些基于JS保护。

phantomjs是一个可执行文件,并与所有的JS评估它会加载一个URI作为一个真正的浏览器。 您可以通过subprocess.call([phantomJsPath, jsProgramPath, url, htmlFileToSave])

对于jsProgram例如从Python中运行它,你可以检查https://github.com/ariya/phantomjs/blob/master/examples/rasterize.js

要从JS程序保存HTML中,使用fs.write(htmlFileToSave, page.content, "w");

我测试过这种方法对dwarozh.net它工作,但你应该找出如何插入到这一点你scrapy管道。

专门为你的榜样,你可以尝试为“手动”解析提供的JavaScript来获得所需要加载的实际网页的Cookie细节。虽然Sucuri算法可能会在任何时候改变,任何基于cookie或js解码的解决方案都将被打破。

+0

感谢您的关注,我也试着像你所说,通过https://github.com/scrapinghub/splash和https://github.com/scrapy-plugins/scrapy-splash但不是为我工作:(。 你可以提供一些片段或scrapy基溶液请 **注:我想抓取和每小时凑几个站点**和我的方法一定要快 – zhilevan