R更改html格式的值并刮取网页数据

我想从此页面抓取历史天气数据http://www.weather.gov.sg/climate-historical-daily。R更改html格式的值并刮取网页数据

我正在使用此链接中给出的代码Using r to navigate and scrape a webpage with drop down html forms。

但是，我无法获取数据可能是由于页面结构的变化。在上面的链接代码pgform <-html_form(pgsession)[[3]]被用来改变表单的值。在我的情况下，我无法找到类似的表格。

url <- "http://www.weather.gov.sg/climate-historical-daily" 
pgsession <- html_session(url) 
pgsource <- read_html(url) 
pgform <- html_form(pgsession)

结果在我的情况

> pgform 
[[1]] 
<form> 'searchform' (GET http://www.weather.gov.sg/) 
<button submit> '<unnamed> 
<input text> 's':

来源

2017-04-26 cutepanda

这只是让搜索框，而不是实际控制，这是不是在'

谢谢，我同意你的网页有下载链接。但是我需要最近3年的数据，下拉列表中列出的所有电台。我想如果我能弄清楚这部分，我可以写一个循环来获取数据。 – cutepanda

由于页面有一个CSV下载按钮和链接提供遵循一个模式，您可以生成并下载URL集。你需要一组台ID，您可以从下拉本身刮去：

library(rvest) 

page <- 'http://www.weather.gov.sg/climate-historical-daily' %>% read_html() 

station_id <- page %>% html_nodes('button#cityname + ul a') %>% 
    html_attr('onclick') %>% # If you need names, grab the `href` attribute, too. 
    sub(".*'(.*)'.*", '\\1', .)

然后可以投入expand.grid与几个月和几年来生成所需的全部组合：

df <- expand.grid(station_id, 
        month = sprintf('%02d', 1:12), 
        year = 2014:2016)

（请注意，如果你想2017年的数据，你需要单独建造这些和rbind以免构建还没有发生个月）

的组合然后可以paste0编入网址：

urls <- paste0('http://www.weather.gov.sg/files/dailydata/DAILYDATA_', 
       df$station_id, '_', df$year, df$month, '.csv')

可以是lapply ED跨越下载的所有文件：

# Warning! This will download a lot of files! Make sure you're in a clean directory.  
lapply(urls, function(url){download.file(url, basename(url), method = 'curl')})

来源

2017-04-26 07:32:52 alistaire

R更改html格式的值并刮取网页数据

回答

相关问题