2017-06-05 31 views
0

我有以下data.table有R如何用R data.table按组计算分类变量的频率/表?

library(data.table) 
dt = data.table(ID = c("person1", "person1", "person1", "person2", "person2", "person2", "person2", "person2", ...), category = c("red", "red", "blue", "red", "red", "blue", "green", "green", ...)) 

dt 
ID   category 
person1 red 
person1 red 
person1 blue 
person2 red 
person2 red 
person2 blue 
person2 green 
person2 green 
person3 blue 
.... 

我在寻找如何创建一个分类变量redblue,​​每个唯一ID的“频率”,然后展开这些列记录为每个计数。得到的data.table应该是这样的:

dt 
ID  red blue green 
person1 2  1  0 
person2 2  1  2  
... 

我认为不正确的正确的方式开始这与data.table将计算table()的基团,如

dt[, counts :=table(category), by=ID] 

但是,这似乎是通过组ID来计算分类值的总数。这也不能解决我“扩大”data.table的问题。

这样做的正确方法是什么?

回答

1

是否这样?

library(data.table) 
library(dplyr) 
dt[, .N, by = .(ID, category)] %>% dcast(ID ~ category) 

如果您希望将这些列添加到原data.table

counts <- dt[, .N, by = .(ID, category)] %>% dcast(ID ~ category) 
counts[is.na(counts)] <- 0 
output <- merge(dt, counts, by = "ID") 
+0

This Works!有一个问题(因为我不熟悉'dpylr'):假设原来的'dt'有几列:如果我想保留另一列,该怎么办?目前,'dcast(ID〜category)'产生一个只有ID和类别的data.table(就像我的例子)。 – ShanZhengYang

+0

看我的编辑。您可以将表格数据合并到原始数据。 –

1

这是在命令行式风格做了,有可能是一个更清洁,做功能性的方式。

library(data.table) 
library(dtplyr) 
dt = data.table(ID = c("person1", "person1", "person1", "person2", "person2", "person2", "person2", "person2"), 
       category = c("red", "red", "blue", "red", "red", "blue", "green", "green")) 


ids <- unique(dt$ID) 
categories <- unique(dt$category) 
counts <- matrix(nrow=length(ids), ncol=length(categories)) 
rownames(counts) <- ids 
colnames(counts) <- categories 

for (i in seq_along(ids)) { 
    for (j in seq_along(categories)) { 
    count <- dt %>% 
     filter(ID == ids[i], category == categories[j]) %>% 
     nrow() 

    counts[i, j] <- count 
    } 
} 

然后:

>counts 
##   red blue green 
##person1 2 1  0 
##person2 2 1  2 
1

您可以使用重塑库一行。

library(reshape2) 
dcast(data=dt, 
     ID ~ category, 
     fun.aggregate = length, 
     value.var = "category") 

     ID blue green red 
1 person1 1  0 2 
2 person2 1  2 2 

另外,如果你只需要一个简单的2路表,你可以使用内建的[R table功能。

table(dt$ID,dt$category)