2017-07-24 43 views
1

城市,州和地址我有如下字符串的形式地址:斯普利特地址字符串为R中

dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA", 
           "1626 Aviation Way, Augusta, GA 30906, USA", 
           "325 Main St, Stratford, CT 06615, USA", 
           "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE) 

我想它分成5列,比如街道,城市,州,邮政编码,邮政编码。 我该如何在R中做到这一点。

+0

查看'strsplit'或'regexpr'。 – ekstroem

+0

或者如果您使用的是数据框,则可以使用'tidyr'中的'separate()'函数。 –

+0

我试着做这个<-strsplit($ Adress,“,”)。我没有得到正确的答案。以下是我尝试在数据框中写入时发生的错误:错误(函数(...,row.names = NULL,check.rows = FALSE,check.names = TRUE,: 参数意味着行数不同:4,5 – Kaushik

回答

1

这最终导致了很多步骤。你可以做得更少,但这是我做到的。我还假设yoru数据是在一个数据框中以每行一个地址开始。

dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA", 
       "1626 Aviation Way, Augusta, GA 30906, USA", 
       "325 Main St, Stratford, CT 06615, USA", 
       "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE) 

> dat 
             Addresses 
1 1626 Aviation Way, Albuquerque, NM 30906, USA 
2  1626 Aviation Way, Augusta, GA 30906, USA 
3   325 Main St, Stratford, CT 06615, USA 
4 4205 Bessie Coleman Blvd, Tampa, FL 33607, USA 

现在,我们需要分割逗号来启动,然后将状态和zip分开。我也将通过分割逗号来删除多余的空格。

dat2 = sapply(dat$Addresses, strsplit, ",") 
dat2 = lapply(dat2, trimws) 

> dat2 
$`1626 Aviation Way, Albuquerque, NM 30906, USA` 
[1] "1626 Aviation Way" "Albuquerque"  "NM 30906"   "USA"    

$`1626 Aviation Way, Augusta, GA 30906, USA` 
[1] "1626 Aviation Way" "Augusta"   "GA 30906"   "USA"    

$`325 Main St, Stratford, CT 06615, USA` 
[1] "325 Main St" "Stratford" "CT 06615" "USA"   

$`4205 Bessie Coleman Blvd, Tampa, FL 33607, USA` 
[1] "4205 Bessie Coleman Blvd" "Tampa"     "FL 33607"     "USA"  

现在,我们需要将其重新置回数据框。

dat2 = data.frame(matrix(unlist(dat2), ncol = 4, byrow = TRUE), stringsAsFactors = FALSE) 

> dat2 
         X1   X2  X3 X4 
1  1626 Aviation Way Albuquerque NM 30906 USA 
2  1626 Aviation Way  Augusta GA 30906 USA 
3    325 Main St Stratford CT 06615 USA 
4 4205 Bessie Coleman Blvd  Tampa FL 33607 USA 

接下来,我们可以将x3分成状态和zip,然后删除该列。

dat2$State = sapply(dat2$X3, function(x) strsplit(x, " ")[[1]][1]) 
dat2$Zip = sapply(dat2$X3, function(x) strsplit(x, " ")[[1]][2]) 

dat2 = dat2[, -3] 

> dat2 
         X1   X2 X4 State Zip 
1  1626 Aviation Way Albuquerque USA NM 30906 
2  1626 Aviation Way  Augusta USA GA 30906 
3    325 Main St Stratford USA CT 06615 
4 4205 Bessie Coleman Blvd  Tampa USA FL 33607 

最后,我们可以设置列名称,我们就完成了。

colnames(dat2) = c("Street", "City", "Country", "State", "Zip") 
> dat2 
        Street  City Country State Zip 
1  1626 Aviation Way Albuquerque  USA NM 30906 
2  1626 Aviation Way  Augusta  USA GA 30906 
3    325 Main St Stratford  USA CT 06615 
4 4205 Bessie Coleman Blvd  Tampa  USA FL 33607 
+0

@kristoferesen,在执行数据帧命令时出现以下错误:”Warning message: In matrix(unlist(dat2 ),ncol = 4,byrow = TRUE): 数据长度[413]不是行数的倍数或倍数[0124]“ – Kaushik

+0

@Kaushik你的数据框看起来就像我的数据框恰好在它变回数据框之前? – Kristofersen

+0

@Kaushik确保在原始数据框中包含'stringsAsFactors = FALSE'。否则地址将是因素,并且strsplit将不起作用。 – Kristofersen

1

我用一行代码解决了它。对于正则表达式专家可能看起来有点幼稚,但对于它的示例数据它可能工作。

library(stringr) 

dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA", 
           "1626 Aviation Way, Augusta, GA 30906, USA", 
           "325 Main St, Stratford, CT 06615, USA", 
           "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE) 

str_match(dat$Addresses,"(.+), (.+), (.+) (.+), (.+)")[ ,-1] 
     [,1]      [,2]   [,3] [,4] [,5] 
[1,] "1626 Aviation Way"  "Albuquerque" "NM" "30906" "USA" 
[2,] "1626 Aviation Way"  "Augusta"  "GA" "30906" "USA" 
[3,] "325 Main St"    "Stratford" "CT" "06615" "USA" 
[4,] "4205 Bessie Coleman Blvd" "Tampa"  "FL" "33607" "USA"