2015-08-24 76 views
1

我想要废除超链接中的地理编码,并且想要将所有表格与地理编码一起制成表格。rvest获取表格中的超链接

我做了什么,现在是通过使用下面的代码

library(rvest) 

url<-"http://www.city-data.com/accidents/acc-Nashua-New-Hampshire.html" 

citidata<- html(url) 
ta<- citidata %>% 
html_nodes("table") %>% 
.[1:29] %>% 
html_table() 

dat<-do.call(rbind, lapply(ta, data.frame, stringsAsFactors=FALSE)) 

citystate <- citidata %>% 
html_node("h1 span") %>% 
html_text() 

citystate <- gsub("Fatal car crashes and road traffic accidents in ", 
        "", citystate) 

loc<-data.frame(matrix(unlist(strsplit(citystate, ",", fixed = TRUE)), ncol=2, byrow=TRUE)) 
dat$City<-loc$X1 
dat$State<-loc$X2 

得到一个表,我得到这个

Date,Location,Vehicles,Drunken.persons,Fatalites,Persons,Pedestrians,City,State 
1 Jun 26, 2013 87:99 PM, Temple Street, 1, -, 1, 1, -, Nashua, New Hampshire 

然后我尝试在地理编码加入到数据帧,但不知道如何去做。

下面是在超链接中废除地理编码的代码。

pg <- html("http://www.city-data.com/accidents/acc-Nashua-New-Hampshire.html") 
geo <- data.frame(gsub("javascript:showGoogleSView","",pg %>% html_nodes("a") %>% html_attr("href") %>% .[31:60])) 
+0

一个问题(最初)是'dat'有98行,地缘' '有30 – hrbrmstr

+0

是的,并不是所有的数据都带有地理位置。 – Jen

回答

1

并非所有事件都具有关联的经/纬度对。下面的代码使用的事实,事件发生的日期是(显然)独特的合并,你前面建有主dat坐标:

library(rvest) 
library(stringr) 
library(dplyr) 

url <- "http://www.city-data.com/accidents/acc-Nashua-New-Hampshire.html" 

# Get all incident tables ------------------------------------------------- 

citidata <- html(url) 

ta <- citidata %>% 
    html_nodes("table") %>% 
    .[1:29] %>% 
    html_table() 

# rbind them together ----------------------------------------------------- 

dat <- do.call(rbind, lapply(ta, data.frame, stringsAsFactors=FALSE)) 

citystate <- citidata %>% 
    html_node("h1 span") %>% 
    html_text() 

# Get city/state and add it to the data.frame ------------------------------- 

citystate <- gsub("Fatal car crashes and road traffic accidents in ", 
        "", citystate) 

loc <- data.frame(matrix(unlist(strsplit(citystate, ",", fixed=TRUE)), 
         ncol=2, byrow=TRUE)) 

dat$City <- loc$X1 
dat$State <- loc$X2 

# Get GPS coords where available ------------------------------------------ 

coords <- citidata %>% 
    html_nodes(xpath="//a[@class='showStreetViewLink']") %>% 
    html_attr("href") %>% 
    str_extract("([[:digit:]-,\\.]+)") %>% 
    str_split(",") %>% 
    unlist() %>% 
    matrix(ncol=2, byrow=2) %>% 
    data.frame(stringsAsFactors=FALSE) %>% 
    rename(lat=X1, lon=X2) %>% 
    mutate(lat=as.numeric(lat), lon=as.numeric(lon)) 

# Get GPS coordinates associated incident time for merge ------------------ 

coord_time <- pg %>% 
    html_nodes(xpath="//a[@class='showStreetViewLink']/../preceding-sibling::td[1]") %>% 
    html_text() %>% 
    data_frame(Date=.) 

# Merge the coordinates with the data.frame we built earlier -------------- 

left_join(dat, bind_cols(coords, coord_time)) 
+0

是的,有一些不可用的连接线,我想我可以在将它们合并在一起之前将它们分开。但另一个问题是,如果他们不是按顺序排列的(中间缺少),我该如何与时间匹配呢? – Jen