2013-05-20 34 views
2

你好,我有以下data.frame(追加)。我想添加一个标准化计数的额外列N = N/sum(N)。我有没有日期列前一个data.frame,并能够做到这一点使用正常化数据R

oo[, N.norm := N/sum(N), by=Operator]

我试图通过功能

oo[, N.norm := N/sum(N), by=Operator,Date] 

到日期添加到,但收到一条错误消息

Error in `[.data.frame`(oo, , `:=`(N.norm, N/sum(N)), by = Operator, Date) : 
    unused argument(s) (by = Operator) 

例如,对于运营商“A”在月“2013年1月”,我有每个计数N数量= c(“好”,“好”,“差”,“废话”)。我想总结n该组合(A和2013年1月)和sum(N)

划分数N在另一方面,任何人都可以给我提供一个体面的介绍操纵data.frames R中

structure(list(Operator = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("A", 
"D", "J", "L", "M"), class = "factor"), ROI_Score = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 
4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 
3L, 3L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 
4L, 4L, 4L), .Label = c("Crap", "Good", "OK", "Poor"), class = "factor"), 
    Date = c("Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013"), N = c(0, 0, 0, 0, 0, 1, 2, 15, 1, 5, 3, 2, 3, 
    1, 0, 3, 0, 5, 5, 1, 0, 0, 0, 1, 0, 14, 17, 16, 8, 7, 5, 
    10, 6, 1, 5, 24, 27, 31, 16, 15, 0, 0, 0, 0, 0, 26, 24, 20, 
    11, 18, 3, 4, 17, 3, 2, 20, 36, 12, 21, 9, 0, 0, 0, 0, 0, 
    3, 12, 5, 12, 4, 0, 0, 3, 4, 0, 29, 37, 41, 25, 10, 0, 0, 
    0, 0, 0, 9, 9, 15, 17, 3, 6, 4, 5, 4, 1, 14, 13, 9, 15, 9 
    )), .Names = c("Operator", "ROI_Score", "Date", "N"), row.names = c(NA, 
100L), class = "data.frame") 

我不确定数据是以data.frame还是data.table格式。这里是我的代码,改编自阿伦(reshape/remould data frame to create normalized bar chart and pie chart)给出解决办法

df <- data.frame(read.csv("/misc/jaguar_data/report/system/db_fs/roi_scores.csv")) 
#Get date into nice structure for faceting 
df$Date = strftime(strptime(df$Date,f="%d/%m/%Y"), "%b %Y") 
dt <- data.table(df) 
ops <- as.character(unique(dt$Operator)) 
scr <- as.character(unique(dt$ROI_Score)) 
dts <- unique(dt$Date) 

oo <- setkey(dt[, .N, by="Operator,ROI_Score,Date"], Operator, 
ROI_Score,Date)[CJ(ops, scr,dts)][is.na(N), N:= 0L] 

oo[, N.norm := N/sum(N), by=Operator] 
+2

这个附加列:第i行的N.norm应该是N [i]/sum(N [1 ... i),但是由操作员和日期汇总?你真的是指'data.table'而不是'data.frame'吗? ':='运算符仅限于'data.table'。请澄清您正在使用的结构:您给了我们一个数据框。 –

+0

@BryanHanson - 我不确定。我已经更新了我的问题,以解释我如何使用数据结构oo。它最初是一个data.frame,但我认为它现在是一个data.table – moadeep

+0

你绝对使用'data.table',看你自己的代码,这使得清楚(你开始一个'data.frame',但它转向它到'data.table')。通常在数据集非常大且速度非常关键时使用这些数据。否则,'data.frame'通常很好。你试图计算什么? –

回答

4

你的代码是(差不多)完美。两个轻微的问题。

1:您正在使用data.table语法,所以不是oo是一个data.frame它应该是一个data.table。只需使用:

library(data.table) 
oo <- data.table(oo) 

2:当使用by有多个列,请务必将列list(..)或作为一个单独的逗号分隔的字符串。例如

oo[, N.norm := N/sum(N), by=list(Operator,Date)] 

# - or - # 
oo[, N.norm := N/sum(N), by="Operator,Date"] 

编辑:如果你希望每个总对每个Operator划分 - Date组,那么你的代码应该是以上。相反,如果你想总的整个数据来划分,然后用

oo[, N.norm := N/sum(DT$N), by=list(Operator,Date)] 

固定这两件事情,并使用一切正是因为你知道了:

 Operator ROI_Score  Date N N.norm 
    1:  A  Crap Apr 2013 0 0.0000000 
    2:  A  Crap Feb 2013 0 0.0000000 
    3:  A  Crap Jan 2013 0 0.0000000 
    4:  A  Crap Mar 2013 0 0.0000000 
    5:  A  Crap May 2013 0 0.0000000 
---           
96:  M  Poor Apr 2013 14 0.4827586 
97:  M  Poor Feb 2013 13 0.5000000 
98:  M  Poor Jan 2013 9 0.3103448 
99:  M  Poor Mar 2013 15 0.4166667 
100:  M  Poor May 2013 9 0.6923077 

编辑2:

只是一个说明。一般来说,如果您使用[括号]中的表达式,尤其是参考赋值运算符:=,那么您的对象应该是data.table

如果你看到一个错误,如

Error in `[.data.frame`(_<your object name>_, ... 

那么这可能是由于这样的事实,或者是(a)你的对象不是data.table或(b)你忘了加载数据。表package

+0

非常感谢。我知道它一定是从我已经有的 – moadeep

+1

@moadeep的代码中简单的破解,没问题。请参阅答案底部的编辑注释 –

1

我不认为你可以做你想做这个数据集的内容。这里的原因:

install.packages("plyr") 
library("plyr") 
str(tmp) # this is your data 
count(tmp, vars = c("Operator", "ROI_Score")) 

给出了这样的:

Operator ROI_Score freq 
1   A  Crap 5 
2   A  Good 5 
3   A  OK 5 
4   A  Poor 5 
5   D  Crap 5 
6   D  Good 5 
7   D  OK 5 
8   D  Poor 5 
9   J  Crap 5 
10  J  Good 5 
11  J  OK 5 
12  J  Poor 5 
13  L  Crap 5 
14  L  Good 5 
15  L  OK 5 
16  L  Poor 5 
17  M  Crap 5 
18  M  Good 5 
19  M  OK 5 
20  M  Poor 5 

而且包括Date使每一个独特的价值,所以都具有1

使用data.frame计数,你要能在什么原理获得者:

ans <- aggregate(N ~ Operator + ROI_Score + Date, data = tmp, FUN = sum) 

然后改变函数做你想要的东西(除以100,条目数?)。但我不确定这是你想要的。

编辑

由于要通过运营商和日期各评级类别的百分比,我会第一子集,然后汇总:

tmp2 <- subset(tmp, Operator == "A") 
ans2 <- aggregate(N ~ ROI_Score, data = tmp2, FUN = sum) 
ans2$N.norm <- ans2$N/sum(ans2$N) 

给出:

ROI_Score N N.norm 
1  Crap 0 0.0000000 
2  Good 24 0.5106383 
3  OK 9 0.1914894 
4  Poor 14 0.2978723 
+0

它不是我所需要的,但我很感谢你的帮助。在上述每个运营商和月份的例子中,有4个可能的分数。如果频率为5,那么总和等于5 + 5 + 5 + 5 = 20。该运营商和月份的百分比分别为25%,25%,25%,好: 25% – moadeep

+0

看看我的编辑使用不同的方法。 –

+0

非常好。非常感谢您的时间和耐心 – moadeep