2014-10-20 72 views
4

替换特定的字符我想从在实施例的数据帧的可变DMA.NAME与.替换所有,-)((空间)。我提到三个职位,并试图他们的方法但都失败了:在数据帧的可变中的R

Replacing column values in data frame, not included in list

R replace all particular values in a data frame

Replace characters from a column of a data frame R

方法1

> shouldbecomeperiod <- c$DMA.NAME %in% c("-", ",", " ", "(", ")") 
c$DMA.NAME[shouldbecomeperiod] <- "." 

方法2

> removetext <- c("-", ",", " ", "(", ")") 
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME) 
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME, fixed = TRUE) 

Warning message: 
In gsub(removetext, ".", c$DMA.NAME) : 
    argument 'pattern' has length > 1 and only the first element will be used 

方法3

> c[c == c(" ", ",", "(", ")", "-")] <- "." 

样本数据帧

> df 
DMA.CODE     DATE     DMA.NAME  count 
111   22 8/14/2014 12:00:00 AM    Columbus, OH  1 
112   23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn  1 
79   18 7/30/2014 12:00:00 AM  Boston (Manchester)  1 
99   22 8/20/2014 12:00:00 AM    Columbus, OH  1 
112.1  23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn  1 
208   27 7/31/2014 12:00:00 AM  Minneapolis-St. Paul  1 

我知道问题 - gsub使用模式和仅第一个元素。其他两种方法是搜索整个变量的确切值,而不是在特定字符的值内搜索。

+7

'gsub'查找*所有*匹配。 'sub'是唯一匹配的第一个 – 2014-10-20 16:24:06

+0

!谢谢。我错误地使用了它。 – vagabond 2014-10-20 21:56:52

回答

4

您可以使用特殊群体[:punct:][:space:]的图形组([...])这样的内部:

df <- data.frame(
    DMA.NAME = c(
    "Columbus, OH", 
    "Orlando-Daytona Bch-Melbrn", 
    "Boston (Manchester)", 
    "Columbus, OH", 
    "Orlando-Daytona Bch-Melbrn", 
    "Minneapolis-St. Paul"), 
    stringsAsFactors=F) 
## 
> gsub("[[:punct:][:space:]]+","\\.",df$DMA.NAME) 
[1] "Columbus.OH"    "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester."   "Columbus.OH"    
[5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul" 
+0

谢谢!和[:punct:]组在这里被提及:http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html – vagabond 2014-10-20 16:53:35

3

如果数据帧是大,你可能想看看这个快捷功能从stringi包。此函数将特定类的每个字符替换为另一个字符。在这种情况下,字符类是L - 字母(在{}之内),但大P(在{}之前)表示我们正在寻找这个集的补充,因此对于每个非字母字符。合并表示连续的匹配应合并成一个匹配。

require(stringi) 
stri_replace_all_charclass(df$DMA.NAME, "\\P{L}",".", merge=T) 
## [1] "Columbus.OH"    "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester."   "Columbus.OH"    
## [5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul" 

而且一些基准:

x <- sample(df$DMA.NAME, 1000, T) 
gsubFun <- function(x){ 
    gsub("[[:punct:][:space:]]+","\\.",x) 
} 

striFun <- function(x){ 
    stri_replace_all_charclass(x, "\\P{L}",".", T) 
} 


require(microbenchmark) 
microbenchmark(gsubFun(x), striFun(x)) 
Unit: microseconds 
     expr  min  lq median  uq  max neval 
gsubFun(x) 3472.276 3511.0015 3538.097 3573.5835 11039.984 100 
striFun(x) 877.259 893.3945 907.769 929.8065 3189.017 100