2017-08-31 94 views
1
刮JS渲染页面

我想刮this page其中包括根据以下铬HTML问题与Scrapy和飞溅

<p class="title"> 

      Orange Paired 

     </p> 

这是我的蜘蛛:

import scrapy 
from scrapy_splash import SplashRequest 

class MySpider(scrapy.Spider): 
    name = "splash" 
    allowed_domains = ["phillips.com"] 
    start_urls = ["https://www.phillips.com/detail/BRIDGET-RILEY/UK010417/19"] 
    def start_requests(self): 
     for url in self.start_urls: 
      yield SplashRequest(
       url, 
       self.parse, 
       endpoint='render.json', 
       args={'har': 1, 'html': 1} 
      ) 
    def parse(self, response): 
     print("1. PARSED", response.real_url, response.url) 
     print("2. ",response.css("title").extract()) 
     print("3. ",response.data["har"]["log"]["pages"]) 
     print("4. ",response.headers.get('Content-Type')) 
     print("5. ",response.xpath('//p[@class="title"]/text()').extract()) 

这是输出的scrapy runspider spiders/splash_spider.py

2017-08-31 09:48:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
1. PARSED http://localhost:8050/render.json https://www.phillips.com/detail/BRIDGET-RILEY/UK010417/19 
2. ['<title>PHILLIPS : Bridget Riley, Orange Paired</title>', '<title>Page 1</title>'] 
3. [{'title': 'PHILLIPS : Bridget Riley, Orange Paired', 'pageTimings': {'onContentLoad': 3832, '_onStarted': 1, '_onIframesRendered': 4667, 'onLoad': 4664, '_onPrepareStart': 4664}, 'id': '1', 'startedDateTime': '2017-08-31T07:48:18.986240Z'}] 
4. b'text/html; charset=utf-8' 
5. [] 
2017-08-31 09:48:23 [scrapy.core.engine] INFO: Closing spider (finished) 

为什么我得到一个空的outp ut为5?

什么我也搞不懂的是,飞溅似乎并没有使上述 enter image description here

链接的页面,但它呈现的顶级网页在这种情况下 enter image description here

回答

1

良好的出发点是查看Splash文档的FAQ部分。事实证明,在你的情况下,你需要disable Private mode用于Splash,可以通过Docker的--disable-private-mode启动选项,或者在你的LUA脚本中设置splash.private_mode_enabled = false

一旦您禁用私人模式,页面呈现正确。

+0

我用'--disable-private-mode'启动了Docker,它工作正常。非常感谢你 – zinyosrim