2017-08-16 77 views
2

我有一个数据集,包含不同点位置的经纬度信息,我想知道哪个城市和州与每个点相关联。从谷歌街道地址提取城市和州信息

在此之后example我使用了revgeocode函数从ggmap以获得一个街道地址对于每个位置,产生以下的数据帧:

df <- structure(list(PointID = c(1787L, 2805L, 3025L, 3027L, 3028L, 
3029L, 3030L, 3031L, 3033L), Latitude = c(38.36648102, 36.19548585, 
43.419774, 43.437222, 43.454722, 43.452643, 43.411949, 43.255479, 
43.261464), Longitude = c(-76.4802046, -94.21554661, -87.960399, 
-88.018333, -87.974722, -87.978542, -87.94149, -87.986433, -87.968612 
), Address = structure(c(2L, 8L, 5L, 3L, 9L, 7L, 4L, 1L, 6L), .Label = c("13004 N Thomas Dr, Mequon, WI 53097, USA", 
"2160 Turner Rd, Lusby, MD 20657, USA", "2805 County Rd Y, Saukville, WI 53080, USA", 
"3701-3739 County Hwy W, Saukville, WI 53080, USA", "3907 Echo Ln, Saukville, WI 53080, USA", 
"4823 W Bonniwell Rd, Mequon, WI 53097, USA", "5100-5260 County Rd I, Saukville, WI 53080, USA", 
"7948 W Gibbs Rd, Springdale, AR 72762, USA", "River Park Rd, Saukville, WI 53080, USA" 
), class = "factor")), row.names = c(NA, -9L), class = "data.frame", .Names = c("PointID", 
"Latitude", "Longitude", "Address")) 

我想用R提取从城市/州信息完整的街道地址,并创建两列来存储此信息(“城市”和“国家”)

我假设stringr包是要走的路,但我不知道如何去使用它上面的example使用下面的代码t o提取邮政编码(在该例中命名为“结果”)。他们的数据集:

#  ID Longitude Latitude           result 
# 1 311175 41.29844 -72.92918 16 Church Street South, New Haven, CT 06519, USA 
# 2 292058 41.93694 -87.66984 1632 West Nelson Street, Chicago, IL 60657, USA 
# 3 12979 37.58096 -77.47144 2077-2199 Seddon Way, Richmond, VA 23230, USA 

和代码提取邮编:

library(stringr) 
data$zipcode <- substr(str_extract(data$result," [0-9]{5}, .+"),2,6) 
data[,-4] 

是否可以很容易地修改上面的代码来获得城市和状态数据?

+0

您在下面收到了很多好的答案。考虑接受最能帮助你解决问题的一个(左边的复选标记)。这让社区知道它对你有用,并承认社区的帮助 – CPak

回答

4

可以使用城市和州revgeocode()本身获得:

df <- cbind(df,do.call(rbind, 
       lapply(1:nrow(df), 
       function(i) 
       revgeocode(as.numeric(
       df[i,3:2]), output = "more")[c("administrative_area_level_1","locality")]))) 

df 

# PointID Latitude Longitude           Address 
# 1 1787 38.36648 -76.48020    2160 Turner Rd, Lusby, MD 20657, USA 
# 2 2805 36.19549 -94.21555  7948 W Gibbs Rd, Springdale, AR 72762, USA 
# 3 3025 43.41977 -87.96040   3907 Echo Ln, Saukville, WI 53080, USA 
# 4 3027 43.43722 -88.01833  2805 County Rd Y, Saukville, WI 53080, USA 
# 5 3028 43.45472 -87.97472   River Park Rd, Saukville, WI 53080, USA 
# 6 3029 43.45264 -87.97854 5100-5260 County Rd I, Saukville, WI 53080, USA 
# 7 3030 43.41195 -87.94149 3701-3739 County Hwy W, Saukville, WI 53080, USA 
# 8 3031 43.25548 -87.98643   13004 N Thomas Dr, Mequon, WI 53097, USA 
# 9 3033 43.26146 -87.96861  4823 W Bonniwell Rd, Mequon, WI 53097, USA 
# administrative_area_level_1 locality 
# 1     Maryland  Lusby 
# 2     Arkansas Springdale 
# 3     Wisconsin Saukville 
# 4     Wisconsin Saukville 
# 5     Wisconsin Saukville 
# 6     Wisconsin Saukville 
# 7     Wisconsin Saukville 
# 8     Wisconsin  Mequon 
# 9     Wisconsin  Mequon 

附:您可以一步完成所有操作(包括获取地址或/和邮政编码)。只需将"address"或/和"postal_code"添加到c("administrative_area_level_1","locality")即您想要提取的变量列表中。

2

1)sub使用sub这样。无需包裹。

正则表达式匹配开头(^)后面跟着最短的字符串,直到逗号和空格后跟最短的字符串(代表城市),直到另一个逗号和空格后面跟着两个字符(表示状态),一个空格,5个字符(代表邮政编码),逗号,空格,美国和字符串结尾。加括号部分的匹配可以通过\ 1,\ 2和\ 3引用,但在双引号内\必须加倍。

如果您的邮政编码不是全部5位,请尝试pat <- "^.*?, (.*?), (..) (.*), USA$"

pat <- "^.*?, (.*?), (..) (.....), USA$" 
transform(df, City = sub(pat, "\\1", Address), 
       State = sub(pat, "\\2", Address), 
       Zip = sub(pat, "\\3", Address)) 

,并提供:

PointID Latitude Longitude           Address  City State Zip 
1 1787 38.36648 -76.48020    2160 Turner Rd, Lusby, MD 20657, USA  Lusby MD 20657 
2 2805 36.19549 -94.21555  7948 W Gibbs Rd, Springdale, AR 72762, USA Springdale AR 72762 
3 3025 43.41977 -87.96040   3907 Echo Ln, Saukville, WI 53080, USA Saukville WI 53080 
4 3027 43.43722 -88.01833  2805 County Rd Y, Saukville, WI 53080, USA Saukville WI 53080 
5 3028 43.45472 -87.97472   River Park Rd, Saukville, WI 53080, USA Saukville WI 53080 
6 3029 43.45264 -87.97854 5100-5260 County Rd I, Saukville, WI 53080, USA Saukville WI 53080 
7 3030 43.41195 -87.94149 3701-3739 County Hwy W, Saukville, WI 53080, USA Saukville WI 53080 
8 3031 43.25548 -87.98643   13004 N Thomas Dr, Mequon, WI 53097, USA  Mequon WI 53097 
9 3033 43.26146 -87.96861  4823 W Bonniwell Rd, Mequon, WI 53097, USA  Mequon WI 53097 

2)read.pattern另一种可能性是read.pattern与上述相同pat

library(gsubfn) 

cn <- c("City", "State", "Zip") 
Address <- as.character(df$Address) 
cbind(df, read.pattern(text = Address, pattern = pat, as.is = TRUE, col.names = cn)) 
2

如果你喜欢使用stringr,你可以这样做:

library(stringr) 
library(data.table) 

parse_address <- function(address){ 

    address <- address %>% 
    str_split(",") %>% 
    .[[1]] 
    state <- address %>% 
    .[3] %>% 
    str_replace_all("[^A-Z]","") 

    zip <- address %>% 
    .[3] %>% 
    str_replace_all("[^0-9]","") 

    city <- address %>% 
    .[2] %>% 
    str_trim() 

    street <- address %>% 
    .[1] %>% 
    str_trim() 

    data.table(street, city, state, zip) 
} 

lapply(df$Address, parse_address) %>% 
    rbindlist 
相关问题