我一直在寻找一种可以在服务器上运行的无头网页浏览器,以便为网页抓取工具编制索引单页面应用程序。 Firslyt我尝试过HTMLUnit和Selenium(HtmlUnitDriver),但他们似乎都对xhr请求有问题。如何为搜索引擎优化PhantomJS以索引单页面应用程序?
我发现PhantomJS表现更好,看起来比较成熟。 PhantomJS有一个internal webserver,所以我决定使用它与我的反向代理。然而,我运行了一个基准测试,PhantomJS以100%的CPU处理内核,平均页面加载时间约为4秒。原因是我必须等待浏览器加载所有资源才能获得正确的结果。这里是我的PhantomJS脚本:
var page = require('webpage');
var system = require('system');
var server = require('webserver').create();
// credit: http://backbonetutorials.com/seo-for-single-page-apps/
var service = server.listen(port, { 'keepAlive': true }, function(z, response) {
var request = page.create();
var lastReceived = new Date().getTime();
var requestCount = 0;
var responseCount = 0;
var requestIds = [];
var startTime = new Date().getTime();
request.onResourceReceived = function (response) {
if (requestIds.indexOf(response.id) !== -1) {
lastReceived = new Date().getTime();
responseCount++;
requestIds[requestIds.indexOf(response.id)] = null;
}
};
request.onResourceRequested = function (request) {
if (requestIds.indexOf(request.id) === -1) {
requestIds.push(request.id);
requestCount++;
}
};
request.settings = {
loadImages: false,
javascriptEnabled: true,
loadPlugins: false
};
request.open(z.url, function (status, a) {
if (status !== 'success') {
console.log('FAIL to load the address '+a);
}
});
var checkComplete = function() {
var now = new Date().getTime();
if ((now - lastReceived > 300 && requestCount === responseCount) || now - startTime > 5000) {
clearInterval(checkCompleteInterval);
response.statusCode = 200;
response.headers = {
'Cache': 'no-cache',
'Content-Type': 'text/html; charset=UTF-8',
'Connection': 'Keep-Alive',
'Keep-Alive': 'timeout=5, max=100',
'Content-Length': request.content.length
};
response.write(request.content);
response.close();
request.release();
console.log(request.url+" -> "+(now - startTime));
}
}
var checkCompleteInterval = setInterval(checkComplete, 3);
});
有没有可以做,以加快脚本任何起色,我应该只是使用它的shell命令获得更好的性能运行PhantomJS或有任何替代这些浏览器?
嗯,我没有看到你已经将loadImages设置为false,因此不需要第一个建议。 –
感谢NetworkRequest#中止提示。 Facebook API和分析API“在这种情况下不是必需的,现在脚本看起来更快。 –