我想在网站上刮去广告...如何使用python/scrapy在网站上刮取小部件的输出?
本网站例如
我试图让广告从这个
/HTML/body [@ class ='single single-post postid-171 single-format-standard custom-background hasGoogleVoiceExt']/div [@ id ='site']/div [@ id ='site-out']/div [@ ID = '位点固定'] /格[@ ID = '含量出'] /格[@ ID = '内容在'] /格[@ ID = '主内容缠绕'] /格[@ id ='main-content-contain']/div [@ id ='content-wrap']/div [@ class ='sec-marg-out4 rel [@ class ='post-171 post-type-post status-publish format-standard hentry category-uncategorized']/div [@ id ='['class ='sec-marg-in4']/article [后区域 '] /格[@类=' 后体出 '] /格[@类=' 体后-在 '] /格[@ ID =' 内容的区域'] /格[@class ='content-area-cont left relative']/div [@ class ='sec-marg-out relative']/div [@ class ='sec-marg-in']/div [@ class ='content-area -out']/div [@ class ='content-area-in']/div [@ class ='content-main left relative']/div [@ id ='article-ad']/div [1] /格[@ ID = 'ac_110238'] /格[@类= 'ac_adbox'] /格[@类= 'ac_adbox_inner']
'ac_container' 或 'AC-adbox'
当我去在浏览器中的页面我看到了广告,当我使用scrapy来获取HTML时
其脚本
<div id="contentad110238"></div>
<script type="text/javascript">
(function(d) {
var params =
{
id: "d12cd6f3-b896-443b-9140-07e35e66e222",
d: "YmVzdHlsaW5nLmNvbQ==",
wid: "110238",
cb: (new Date()).getTime()
};
var qs=[];
for(var key in params) qs.push(key+'='+encodeURIComponent(params[key]));
var s = d.createElement('script');s.type='text/javascript';s.async=true;
var p = 'https:' == document.location.protocol ? 'https' : 'http';
s.src = p + "://api.content.ad/Scripts/widget2.aspx?" + qs.join('&');
d.getElementById("contentad110238").appendChild(s);
})(document);
</script> </div>
我该如何刮这个?任何帮助将不胜感激...我猜我必须在python或scrapy中使用js渲染器....建议?
我是否在我的假设中正确?我将不得不渲染,即。这就是为什么它没有显示? – user3707960