在数据帧的可变中的R

替换特定的字符我想从在实施例的数据帧的可变DMA.NAME与.替换所有,，-，)，(和（空间）。我提到三个职位，并试图他们的方法但都失败了：在数据帧的可变中的R

方法1

> shouldbecomeperiod <- c$DMA.NAME %in% c("-", ",", " ", "(", ")") 
c$DMA.NAME[shouldbecomeperiod] <- "."

方法2

> removetext <- c("-", ",", " ", "(", ")") 
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME) 
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME, fixed = TRUE) 

Warning message: 
In gsub(removetext, ".", c$DMA.NAME) : 
    argument 'pattern' has length > 1 and only the first element will be used

方法3

> c[c == c(" ", ",", "(", ")", "-")] <- "."

样本数据帧

> df 
DMA.CODE     DATE     DMA.NAME  count 
111   22 8/14/2014 12:00:00 AM    Columbus, OH  1 
112   23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn  1 
79   18 7/30/2014 12:00:00 AM  Boston (Manchester)  1 
99   22 8/20/2014 12:00:00 AM    Columbus, OH  1 
112.1  23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn  1 
208   27 7/31/2014 12:00:00 AM  Minneapolis-St. Paul  1

我知道问题 - gsub使用模式和仅第一个元素。其他两种方法是搜索整个变量的确切值，而不是在特定字符的值内搜索。

来源

2014-10-20 vagabond

'gsub'查找*所有*匹配。 'sub'是唯一匹配的第一个 – 2014-10-20 16:24:06

！谢谢。我错误地使用了它。 – vagabond 2014-10-20 21:56:52

您可以使用特殊群体[:punct:]和[:space:]的图形组（[...]）这样的内部：

df <- data.frame(
    DMA.NAME = c(
    "Columbus, OH", 
    "Orlando-Daytona Bch-Melbrn", 
    "Boston (Manchester)", 
    "Columbus, OH", 
    "Orlando-Daytona Bch-Melbrn", 
    "Minneapolis-St. Paul"), 
    stringsAsFactors=F) 
## 
> gsub("[[:punct:][:space:]]+","\\.",df$DMA.NAME) 
[1] "Columbus.OH"    "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester."   "Columbus.OH"    
[5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul"

来源

2014-10-20 16:43:43 nrussell

谢谢！和[：punct：]组在这里被提及：http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html – vagabond 2014-10-20 16:53:35

如果数据帧是大，你可能想看看这个快捷功能从stringi包。此函数将特定类的每个字符替换为另一个字符。在这种情况下，字符类是L - 字母（在{}之内），但大P（在{}之前）表示我们正在寻找这个集的补充，因此对于每个非字母字符。合并表示连续的匹配应合并成一个匹配。

require(stringi) 
stri_replace_all_charclass(df$DMA.NAME, "\\P{L}",".", merge=T) 
## [1] "Columbus.OH"    "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester."   "Columbus.OH"    
## [5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul"

而且一些基准：

x <- sample(df$DMA.NAME, 1000, T) 
gsubFun <- function(x){ 
    gsub("[[:punct:][:space:]]+","\\.",x) 
} 

striFun <- function(x){ 
    stri_replace_all_charclass(x, "\\P{L}",".", T) 
} 


require(microbenchmark) 
microbenchmark(gsubFun(x), striFun(x)) 
Unit: microseconds 
     expr  min  lq median  uq  max neval 
gsubFun(x) 3472.276 3511.0015 3538.097 3573.5835 11039.984 100 
striFun(x) 877.259 893.3945 907.769 929.8065 3189.017 100

来源

2014-10-21 06:11:46 bartektartanus

在数据帧的可变中的R

回答

相关问题