2014-10-27 80 views
2

假设我有以下数据集,其中列的结构如下。从字符串中提取元素

df1 = data.frame(Date=c(rnorm(5)), 
       "United States) New York (NY" = c(rnorm(5)), 
       "United States) Chicago (Illinois" = c(rnorm(5)), 
       "United States) Denver (Colorado" = c(rnorm(5)), 
       "United States) Seattle (Washington" = c(rnorm(5)), 
       "United States) Minneapolis (Minnesota" = c(rnorm(5)), check.names=FALSE) 
df1 

df2 = data.frame(Date=c(rnorm(5)), 
       "New York (New York, United States)" = c(rnorm(5)), 
       "Phoenix (Arizona, United States)" = c(rnorm(5)), 
       "Chicago (Illinois, United States)" = c(rnorm(5)), 
       "Los Angeles (California, United States)" = c(rnorm(5)), check.names=FALSE) 
df2 

正如您所看到的,每列仅用于表示城市,但列名的结构不可管理。我想知道是否有人能帮我弄清楚如何从列名字符串中提取城市名称。

我可以为每个城市准备一本字典,并进行字符串匹配,但我对此一无所知。我也认为有一种方法可以用str_split来做到这一点,但我还没有弄明白。

sapply(str_split(names(df1),")"), 2) 

当然,我敢肯定还有一个gsub解决方案,但在正则表达式方面,我有点无能为力。

最终,我只想将实际的城市名称作为列名称。

New York, Chicago, Denver, Seattle, Minneapolis 
+2

您可能希望为这些示例数据框调用添加'check.names = FALSE'。 – 2014-10-27 23:36:43

+0

是的,很好的电话。 – ATMA 2014-10-27 23:42:04

回答

3

您可以使用gsub。这给了第一个数据帧

gsub(".*[)] (.*) [(].*", "\\1", names(df1)[-1]) 
# [1] "New York" "Chicago"  "Denver"  "Seattle"  "Minneapolis" 

对于第二个数据帧上一试,稍加调整到第一个正则表达式将工作

gsub("(.*) [(].*", "\\1", names(df2)[-1]) 
# [1] "New York" "Phoenix"  "Chicago"  "Los Angeles" 

这两个结合到一个对于这两套名称:

nms <- c(names(df1)[-1], names(df2)[-1]) 
gsub("(.*[)] |)(.*) [(].*", "\\2", nms) 
# [1] "New York" "Chicago"  "Denver"  "Seattle"  "Minneapolis" 
# [6] "New York" "Phoenix"  "Chicago"  "Los Angeles" 
+2

+1正则黑魔法拯救 – ialm 2014-10-27 23:46:32

+1

如果正则表达式是黑魔法,会让@hwnd索伦 – 2014-10-28 00:54:53