2017-05-05 46 views
3

,我有以下数据表:高效的方式

require(data.table) 
dt1 <- data.table(ind = 1:8, cat = c("A", "A", "A", "B", "B", "C", "C", "D"), counts = (10:3)) 

    ind cat counts 
1: 1 A  10 
2: 2 A  9 
3: 3 A  8 
4: 4 B  7 
5: 5 B  6 
6: 6 C  5 
7: 7 C  4 
8: 8 D  3 

我想达成什么是增加一排这在计数有每只猫猫和猫A.对于这些行的总和(计数)的总和(计数)之间差异的IND应该是0 基本上我想rbind以下信息:

added_info <- cbind(ind =0, dt1[, .(counts = dt1[cat == "A", sum(counts)] - sum(counts)), by = cat]) 

> added_info 
    ind cat counts 
1: 0 A  0 
2: 0 B  14 
3: 0 C  18 
4: 0 D  24 

而结束结果将是:

dt1 <- rbind(dt1, added_info)[order(cat)] 

> dt1 
    ind cat counts 
1: 1 A  10 
2: 2 A  9 
3: 3 A  8 
4: 0 A  0 
5: 4 B  7 
6: 5 B  6 
7: 0 B  14 
8: 6 C  5 
9: 7 C  4 
10: 0 C  18 
11: 8 D  3 
12: 0 D  24 

我的问题是,如果有实现这一目标使用的数据表的一个更好的(更短)的方式(比如用.I或.N?)

+0

如果猫的计数的总和存储在'x'你可以使用'rbind(DT1,DT1 [做到一步到位,(IND = 0,计数= X - 总和(计数)) ,by = cat],use.names = TRUE)''但我认为这不会有很大的区别 –

+0

也许'dt1 [,c:= sum(counts [cat ==“A”])] [,。( ind = c(ind,0),counts = c(counts,c [.N] -counts [.N])),cat] []'? – lukeA

+0

@docendodiscimus是的,你是对的没有显着差异。 @lukeA这并不是我想要的,但是通过将它改为'dt1 [,c:= sum(counts [cat ==“A”])] [,。(ind = c(ind,0),count = c(counts,c [.N] -sum(counts))),cat] []'这给了我期待的结果 – User2321

回答

4

你可以做

require(data.table) 
dt1 <- data.table(ind = 1:8, cat = c("A", "A", "A", "B", "B", "C", "C", "D"), counts = (10:3)) 
dt1[,c:=sum(counts[cat=="A"])][,.(ind=c(ind,0), counts=c(counts,c[.N]-sum(counts))),cat][] 
#  cat ind counts 
# 1: A 1  10 
# 2: A 2  9 
# 3: A 3  8 
# 4: A 0  0 
# 5: B 4  7 
# 6: B 5  6 
# 7: B 0  14 
# 8: C 6  5 
# 9: C 7  4 
# 10: C 0  18 
# 11: D 8  3 
# 12: D 0  24 
1

这可能是一个data.table调用中的解决方案:

dt1[, rbind(.SD, 
      data.table(ind = 0, 
         counts = dt1[cat == 'A', sum(counts)] - sum(.SD$counts))), 
    by = cat] 

日期:

cat ind counts 
1: A 1  10 
2: A 2  9 
3: A 3  8 
4: A 0  0 
5: B 4  7 
6: B 5  6 
7: B 0  14 
8: C 6  5 
9: C 7  4 
10: C 0  18 
11: D 8  3 
12: D 0  24 
+1

你不觉得''按组来调用'data.table()'是不是比只计算一次dt1的摘要和结构效率更低?另外,你必须计算每组中的A-sum,对吧? (我没有测试过,这只是一个猜测) –

+0

真正的@docendodiscimus它在我的笔记本电脑上效率低了2.5倍(并且不计算每组中的A-sum)。但我只是提供一个替代方案,只是为了扩展你可以做到这一点的方式。有一个小数据集不会显着影响这一点。 – LyzandeR

0

你说高效率,所以...这有两个;唯一可能是矢量化的,data.table by by应该编译为c for循环。

> dt1[, .SD 
     ][, ca := sum(.SD[cat == 'A', counts]) 
     ][, cc := sum(counts), cat 
     ][, cd := ca - cc 
     ][, rbind(.SD, unique(.SD, by=c('cat'))[, `:=`(ind=0)]) 
     ][ind == 0, counts := cd 
     ][, .(cat, ind, counts) 
     ][order(cat, ind) 
     ] 

    cat ind counts 
1: A 0  0 
2: A 1  10 
3: A 2  9 
4: A 3  8 
5: B 0  14 
6: B 4  7 
7: B 5  6 
8: C 0  18 
9: C 6  5 
10: C 7  4 
11: D 0  24 
12: D 8  3 
>