2014-12-31 38 views
1

我从最近的中期版本(我认为1.8.X)更新到data.table - 1.9.4的最新版本,现在我得到了一些意外行为。data.table中的新行为? .N /用'by'(计算比例)

set.seed(12312014) 

# a vector of letters a:e, each repeated between 1 and 10 times 
type <- unlist(mapply(rep, letters[1:5], round(runif(5, 1, 10), 0))) 

# a random vector of 3 categories 
category <- sample(c('small', 'med', 'large'), length(type), replace=T) 
my_dt <- data.table(type, category) 

说我想按类型分类的比例。我曾经这样做,通过这样做:

my_dt[, type_n:=.N, by=type] 
my_dt[, .N/type_n, by=.(type, category)][order(type, category)] 

什么我得到data.table 1.9.4:

# type category  V1 
# 1: a large 0.2500000 
# 2: a large 0.2500000 
# 3: a  med 0.2500000 
# 4: a  med 0.2500000 
# 5: a small 0.5000000 
# 6: a small 0.5000000 
# 7: a small 0.5000000 
# 8: a small 0.5000000 
# 9: b large 0.4285714 
# 10: b large 0.4285714 
# 11: b large 0.4285714 
# 12: b  med 0.4285714 
# (...and so on, 42 rows long) 

但我用得到,我几乎可以肯定,这是(按类型猫的简单比例):

# type category  V1 
# 1: a large 0.2500000 
# 2: a  med 0.2500000 
# 3: a small 0.5000000 
# 4: b large 0.4285714 
# 5: b  med 0.4285714 
# 6: b small 0.1428571 
# 7: c large 0.3000000 
# 8: c  med 0.1000000 
# 9: c small 0.6000000 
# 10: d large 0.2222222 
# 11: d  med 0.6666667 
# 12: d small 0.1111111 
# 13: e large 0.3750000 
# 14: e  med 0.3750000 
# 15: e small 0.2500000 

我能得到这个期望的结果:

unique(my_dt[, .N/type_n, by=.(type, category)][order(type, category)]) 

...但我想知道在新的data.table语法中是否有首选方法。我知道我也可以使用prop.table,但我想要它的长格式。

prop.table(table(my_dt), margin=1) 
# category 
# type  large  med  small 
# a 0.2500000 0.2500000 0.5000000 
# b 0.4285714 0.4285714 0.1428571 
# c 0.3000000 0.1000000 0.6000000 
# d 0.2222222 0.6666667 0.1111111 
# e 0.3750000 0.3750000 0.2500000 

仅供参考,我的电话sessionInfo给出:

R version 3.1.1 (2014-07-10) 
Platform: x86_64-apple-darwin13.1.0 (64-bit) 

locale: 
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 

attached base packages: 
[1] stats  graphics grDevices utils  datasets methods base  

other attached packages: 
[1] ggplot2_1.0.0 data.table_1.9.4 

loaded via a namespace (and not attached): 
[1] chron_2.3-45  colorspace_1.2-4 digest_0.6.4  grid_3.1.1  gtable_0.1.2  labeling_0.2  
[7] MASS_7.3-33  munsell_0.4.2 plyr_1.8.1  proto_0.3-10  Rcpp_0.11.2  reshape2_1.4  
[13] scales_0.2.4  stringr_0.6.2 tools_3.1.1  
+0

所以这些成果的一个你真正想要的? –

+0

不是你的问题的答案,但如果你对'prop.table'满意,只想要一个长格式,你也可以'data.table(prop.table(table(my_dt),margin = 1))' 。 – A5C1D2H2I1M1N2O1R2T1

+0

或'my_dt [,prop.table(table(category)),by = type]' –

回答

2

可以试试

my_dt[, .N, by=.(type,category)][, prop:=N/sum(N), by=type][] 

    type category N  prop 
1: a small 4 0.5000000 
2: a  med 2 0.2500000 
3: a large 2 0.2500000 
4: b  med 3 0.4285714 
5: b large 3 0.4285714 
6: b small 1 0.1428571 
7: c large 3 0.3000000 
8: c small 6 0.6000000 
9: c  med 1 0.1000000 
10: d  med 6 0.6666667 
11: d large 2 0.2222222 
12: d small 1 0.1111111 
13: e small 2 0.2500000 
14: e  med 3 0.3750000 
15: e large 3 0.3750000