2014-03-30 42 views
2

所以情况就是这样。我有一个85万行桌子和18列。其中三列包含Metric Prefix/SI符号中的值(请参阅维基百科上的Metric Prefix)。R data.table加快SI /公制转换

这意味着我有编号类似:

  • .1M代替100000或1E + 5,或
  • 1K而不是1000或1E + 3

样品data.table是

  V1  V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 
1: 2014-03-25 12:15:12 58300 3010 44.0 4.5 0.0 0 0 0.8 50 0.8 10K 303 21K 0  a 56 
2: 2014-03-25 12:15:12 56328 3010 28.0 12.0 0.0 0 0 0.3 60 0.0 59 62 .1M 0  a 66 
3: 2014-03-25 12:15:12 21082 3010 10.0 1.7 0.0 0 0 14.0 72 0.3 4K 208 8K 1  a 80 
4: 2014-03-25 12:15:12 59423 3010 12.0 0.0 0.2 0 0 88.0 0 0.0 20 16 71 0  a 26 
5: 2014-03-25 12:15:12 59423 3010 9.6 1.4 0.0 0 0 60.0 29 0.2 2K 251 6K 0  a 56 
6: 2014-03-25 12:15:12 24193 3010 8.3 1.9 0.0 0 0 9.9 80 0.3 3K 264 8K 1  a 71 
7: 2014-03-25 12:15:12 21082 3010 7.1 1.7 0.4 0 0 6.3 83 0.3 3K 197 7K 0  a 71 
8: 2014-03-25 12:15:12 59423 3010 4.6 1.2 0.0 0 0 57.0 37 0.1 998 81 7K 0  a 118 

我修改了一个由Hans-JörgBibiko编写的函数,用它来修改ggplot2尺度。如果您有兴趣,请参阅网站here。我结束了使用的功能是:

sitor <- function(x) 
{ 
    conv <- paste("E", c(seq(-24 ,-3, by=3), -2, -1, 0, seq(3, 24, by=3)), sep="") 
    names(conv) <- c("y","z","a","f","p","n","µ","m","c","d","","K","M","G","T","P","E","Z","Y") 
    x <- as.character(x) 
    num <- function(x) as.numeric(
     paste(
     strsplit(x,"[A-z|µ]")[[1]][3], 
     ifelse(substr(paste(strsplit(x,"[0-9|\\.]")[[1]], sep="", collapse=""), 1, 1) == "", 
       "", 
       conv[substr(paste(strsplit(x,"[0-9|\\.]")[[1]], sep="", collapse=""), 1, 1)] 
     ), 
     sep="" 
    ) 
    ) 
    return(lapply(x,num)) 
} 

我把它适用于数据表更新像3列

temp[ ,`:=`(V13=sitor(V13),V14=sitor(V14),V15=sitor(V15)) ] 

我申请一个data.table关键矢量到临时表

setkeyv(temp,c("V1","V2","V3","V18")) 

任何61分钟后,我仍然在等待结果...有关如何加快此转换的提示将非常方便,因为我的数据大小将增长4到5倍。

+0

输出更多:'PID USER PR NI VIRT RES SHR S%CPU%MEM TIME + COMMAND' '4878 neurozen 20 0 18.7g 18g 11m R 100.1 62.8 63:38.95 rsession' – neurozen

+0

运行61分钟是什么? 'setkeyv(temp,c(“V1”,“V2”,“V3”,“V18”))或'temp [,':='(V13 = sitor(V13),V14 = sitor(V14) = sitor(V15))]'?你为什么想要对temp进行排序? –

+0

'sitor'返回一个列表......是你列在类型列表的dt列吗? – Michele

回答

2

为什么不试试sitools库?

library(data.table) 
dt<-data.table(var = sample(x=1:1e5, size=1e6, replace=T)) 
library(sitools) 
> system.time(dt[, var2 := f2si(var)]) 
    user system elapsed 
    10.08 0.09 10.89 

编辑:这是一个基于data.table函数从sitools包反向f2si

si2f<-function(x){ 
    if(is.numeric(x)) return(x) 
    require(data.table) 
    dt<-data.table(lab=c("y","z","a","f","p","n","µ","m","c","d","", "da", "h", "k","M","G","T","P","E","Z","Y"), 
       mul=c(1e-24, 1e-21, 1e-18, 1e-15, 1e-12, 1e-9, 1e-6, 1e-3, 1e-2, 1e-1, 1L, 10L, 1e2, 1e3, 1e6, 1e9, 1e12, 1e15, 1e18, 1e21, 1e24), 
       key="lab") 
    res<-as.numeric(gsub("[^0-9|\\.]","", x)) 
    x<-gsub("[0-9]|\\s+|\\.","", x) 
    .subset2(dt[.(x)], "mul")*res 
} 

> system.time(dt[, var3 := si2f(var2)]) 
    user system elapsed 
    13.18 0.03 13.31 

> dt[, all.equal(var,var3)] 
[1] TRUE 
+0

哇,太棒了。我要去试试看! – neurozen

+0

我无法为我的R版本安装scitools软件包:''scitools'不可用(对于R版本3.0.3)' – neurozen

+0

@neurozen它是'sitools' – Michele

1

下面是一个办法,把我的计算机上大约10秒到隐蔽的矢量与10M的值。可以程度上以覆盖比 “K”, “M” & “G” 从顶部

> f_conv <- function(val){ 
+  # create matrix indexed by name for exponent 
+  key <- c(Zero = "" 
+   , K = "E3" 
+   , M = "E6" 
+   , G = "E9" 
+   ) 
+  # extract where the original exponent is 
+  indx <- regexpr("[KMG]", val) 
+  # extract the exponent 
+  exp <- substring(val, indx) 
+  # if there was none, the use "Zero" 
+  exp[indx == -1L] <- "Zero" 
+  # put fake length 
+  indx[indx == -1L] <- 20L 
+  # do the conversion 
+  as.numeric(paste0(substring(val, 1L, indx - 1L) 
+     , key[exp] 
+     ) 
+    ) 
+ } 
> 
> # test data 
> n <- 10000000 
> result <- paste0(sample(1:999, n, TRUE) 
+    , sample(c("K", "M", "G", ""), n, TRUE) 
+   ) 
> 
> system.time(x <- f_conv(result)) 
    user system elapsed 
    8.48 0.13 8.63 
> cbind(result[1:50], x[1:50]) 
     [,1] [,2]   
[1,] "562K" "562000"  
[2,] "946" "946"   
[3,] "313G" "313000000000" 
[4,] "538M" "538000000" 
[5,] "697K" "697000"  
[6,] "486G" "486000000000" 
[7,] "814G" "814000000000" 
[8,] "842" "842"   
[9,] "993M" "993000000" 
[10,] "440K" "440000"  
[11,] "435G" "435000000000" 
[12,] "407M" "407000000" 
[13,] "919K" "919000"  
[14,] "840" "840"   
[15,] "766G" "766000000000" 
[16,] "977" "977"   
[17,] "139" "139"   
[18,] "195G" "195000000000" 
[19,] "609M" "609000000" 
[20,] "69" "69"   
[21,] "147M" "147000000" 
[22,] "104M" "104000000" 
[23,] "509K" "509000"  
[24,] "951M" "951000000" 
[25,] "278" "278"   
[26,] "797G" "797000000000" 
[27,] "106K" "106000"  
[28,] "667K" "667000"  
[29,] "521K" "521000"  
[30,] "9" "9"   
[31,] "17K" "17000"  
[32,] "673M" "673000000" 
+0

我会给它一个去。 @Michelle @Data Munger仅供参考我也想过使用'grep'来选择其中只有公制单位的行temp [grep(“[KMG] $”,V1),V1:= sitor(V1)]) 。捕获是你必须返回字符,然后再次运行as.numeric在整个列。 – neurozen

+0

我给它一个去,它工作得很好。运行时间为237.6秒,针对具有8,370万行的数据表的3列进行运行。谢谢! – neurozen