问题与BS4刮去网站

通常我可以编写一个脚本，用于抓取，但我一直在抓这个网站的表格为我正在研究这个研究项目。我打算在输入我的目标状态的URL之前验证在一个国家工作的脚本。问题与BS4刮去网站

import requests 
import bs4 as bs 

url = ("http://programs.dsireusa.org/system/program/detail/284") 
dsire_get = requests.get(url) 
soup = bs.BeautifulSoup(dsire_get.text,'lxml') 
table = soup.findAll('div', {'data-ng-controller': 'DetailsPageCtrl'}) 
print(table) 
#I'm printing "Table" just to ensure that the table information I'm looking for is within this sections

我不知道，如果该网站试图从刮阻止的人，但所有我正在寻找抢的信息是“QUOT &”内，如果你的样子表输出。

来源

2017-07-06 vlepore

你试过'html.parser'而不是'lxml'吗？ – martinB0103

你想要哪个页面的哪一部分？以“计划概述”为主题的部分？还是那个以“权威”为主的？或者是其他东西？ –

@BillBell我正在寻找“程序概述” – vlepore

所以，我终于成功地解决了这个问题，并successfuly如下为我工作从JavaScript页面代码获取数据，如果任何人试图在遇到相同的问题使用Python来刮取一个JavaScript网页使用Windows（dryscrape不兼容）。

import bs4 as bs 
from selenium import webdriver 
from selenium.common.exceptions import NoSuchElementException 
from selenium.webdriver.common.keys import Keys 
browser = webdriver.Chrome() 
url = ("http://programs.dsireusa.org/system/program/detail/284") 
browser.get(url) 
html_source = browser.page_source 
browser.quit() 
soup = bs.BeautifulSoup(html_source, "html.parser") 
table = soup.find('div', {'class': 'programOverview'}) 
data = [] 
for n in table.findAll("div", {"class": "ng-binding"}): 
    trip = str(n.text) 
    data.append(trip)

来源

2017-07-07 17:16:29 vlepore

该文本是用JavaScript呈现的。首先渲染dryscrape

的页面（如果你不希望使用dryscrape看到Web-scraping JavaScript page with Python）

然后文本可以被提取后，它已经呈现，从不同的位置，即在网页上将它渲染到的地方。

作为示例，此代码将从摘要中提取HTML。

import bs4 as bs 
import dryscrape 

url = ("http://programs.dsireusa.org/system/program/detail/284") 
session = dryscrape.Session() 
session.visit(url) 
dsire_get = session.body() 
soup = bs.BeautifulSoup(dsire_get,'html.parser') 
table = soup.findAll('div', {'class': 'programSummary ng-binding'}) 
print(table[0])

输出：

<div class="programSummary ng-binding" data-ng-bind-html="program.summary"><p> 
<strong>Eligibility and Availability</strong></p> 
<p> 
Net metering is available to all "qualifying facilities" (QFs), as defined by the federal <i>Public Utility Regulatory Policies Act of 1978</i> (PURPA), which pertains to renewable energy systems and combined heat and power systems up to 80 megawatts (MW) in capacity. There is no statewide cap on the aggregate capacity of net-metered systems.</p> 
<p> 
All utilities subject to Public ...

来源

2017-07-06 17:22:57

，虽然这看起来像它会工作，dryscrape不正式支持Windows，所以我无法使用它。我将按照在你没有使用dryscape的情况下引用的那篇文章的方式。 – vlepore

这就是为什么我包含链接。无论您使用Dryscrape，Selenium，PyQt还是其他方法，方法都是一样的。 –

问题与BS4刮去网站

回答

相关问题