检索JavaScript使用Puppeteer呈现HTML

我试图从this NCBI.gov page刮掉html。我需要包含＃see-all URL片段，这样我才能保证获得搜索页面，而不是从不正确的基因页面检索HTML页面https://www.ncbi.nlm.nih.gov/gene/119016。检索JavaScript使用Puppeteer呈现HTML

URL片段不会传递到服务器，而是被页面客户端的JavaScript用来（在这种情况下）创建完全不同的HTML，这是您在页面中访问时获得的浏览器和“查看页面源”，这是我想要检索的HTML。 R readLines() ignores url tags followed by #

我第一次尝试使用phantomJS，但它只是回到这里ReferenceError: Can't find variable: Map描述的错误，似乎从phantomJS不支持该NCBI使用某些功能，从而消除了解决这一路线造成的。

我使用Node.js的使用评价以下JavaScript曾与木偶更大的成功：

const puppeteer = require('puppeteer'); 
(async() => { 
    const browser = await puppeteer.launch(); 
    const page = await browser.newPage(); 
    await page.goto(
    'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all'); 
    var HTML = await page.content() 
    const fs = require('fs'); 
    var ws = fs.createWriteStream(
    'TempInterfaceWithChrome.js' 
); 
    ws.write(HTML); 
    ws.end(); 
    var ws2 = fs.createWriteStream(
    'finishedFlag' 
); 
    ws2.end(); 
    browser.close(); 
})();

然而这回似乎是预渲染HTML。我如何（以编程方式）获取我在浏览器中获得的最终html？

来源

2017-08-24 Sir_Zorg

也许尝试等待

await page.waitForNavigation(5);

后

let html = await page.content();

来源

2017-08-26 04:39:46

你可以尝试改变这一点：这个

await page.goto(
    'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all');

：

await page.goto(
    'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all', {waitUntil: 'networkidle'});

或者，你可以创建一个功能listenFor()，听取他们对页面加载自定义事件：

function listenFor(type) { 
    return page.evaluateOnNewDocument(type => { 
    document.addEventListener(type, e => { 
     window.onCustomEvent({type, detail: e.detail}); 
    }); 
    }, type); 
}` 

await listenFor('custom-event-ready'); // Listen for "custom-event-ready" custom event on page load.

LE：

这也可能会派上用场：

await page.waitForSelector('h3'); // replace h3 with your selector

来源

2017-08-29 14:37:46

检索JavaScript使用Puppeteer呈现HTML

回答

相关问题