2016-01-22 23 views
0

我想在网站上刮去广告...如何使用python/scrapy在网站上刮取小部件的输出?

本网站例如

http://www.bestyling.com/15-of-the-most-expensive-shoes-ever-and-you-wont-believe-whats-1/?utm_source=Ourbrain&utm_medium=cpc&utm_campaign=15%20Shoes%20-%20Desktop%20USA

我试图让广告从这个

/HTML/body [@ class ='single single-post postid-171 single-format-standard custom-background hasGoogleVoiceExt']/div [@ id ='site']/div [@ id ='site-out']/div [@ ID = '位点固定'] /格[@ ID = '含量出'] /格[@ ID = '内容在'] /格[@ ID = '主内容缠绕'] /格[@ id ='main-content-contain']/div [@ id ='content-wrap']/div [@ class ='sec-marg-out4 rel [@ class ='post-171 post-type-post status-publish format-standard hentry category-uncategorized']/div [@ id ='['class ='sec-marg-in4']/article [后区域 '] /格[@类=' 后体出 '] /格[@类=' 体后-在 '] /格[@ ID =' 内容的区域'] /格[@class ='content-area-cont left relative']/div [@ class ='sec-marg-out relative']/div [@ class ='sec-marg-in']/div [@ class ='content-area -out']/div [@ class ='content-area-in']/div [@ class ='content-main left relative']/div [@ id ='article-ad']/div [1] /格[@ ID = 'ac_110238'] /格[@类= 'ac_adbox'] /格[@类= 'ac_adbox_inner']

'ac_container' 或 'AC-adbox'

当我去在浏览器中的页面我看到了广告,当我使用scrapy来获取HTML时

其脚本

<div id="contentad110238"></div> 
    <script type="text/javascript"> 
     (function(d) { 
     var params = 
     { 
      id: "d12cd6f3-b896-443b-9140-07e35e66e222", 
      d: "YmVzdHlsaW5nLmNvbQ==", 
      wid: "110238", 
      cb: (new Date()).getTime() 
     }; 

    var qs=[]; 
    for(var key in params) qs.push(key+'='+encodeURIComponent(params[key])); 
    var s = d.createElement('script');s.type='text/javascript';s.async=true; 
    var p = 'https:' == document.location.protocol ? 'https' : 'http'; 
    s.src = p + "://api.content.ad/Scripts/widget2.aspx?" + qs.join('&'); 
    d.getElementById("contentad110238").appendChild(s); 
})(document); 
</script>              </div> 

我该如何刮这个?任何帮助将不胜感激...我猜我必须在python或scrapy中使用js渲染器....建议?

回答

0

这些广告是通过Javascript获取的,所以当你下载原始HTML(如Scrapy)时,你不会看到它们。

虽然,你可以看看Splash(原ScrapyJS)与Scrapy integration无缝嵌入浏览器与Javascript。直接来自Scrapy开发人员。

一切都在Python中,除了浏览器渲染的Qt。

+0

我是否在我的假设中正确?我将不得不渲染,即。这就是为什么它没有显示? – user3707960