我有很多预订数据(数百万行),并且想要计算存储在两个单独数据表中的不同年份的相同组之间的预订金额的更改(差异=扣除)。R data.table:如何优化每个对应组的两个数据表之间的差值计算?
我可以用这个伟大的data.table如下图所示的代码,但如何可以将代码进行优化(关于性能和内存消耗)因为我coyping数据(表),并有几个计算步骤做可能会一次完成?
# Calculate value differences for the same group of data in two different data.tables
cur <- data.table(company=c("A", "B", "New"), booking.date=seq(from=as.Date("2011/01/01"), by="week", length.out=12), sales.amount = 201:212, vat.amount = 11:22)
cur
prev <- data.table(company=c("A", "B"), booking.date=seq(from=as.Date("2010/01/01"), by="month", length.out=10), sales.amount = 101:110, vat.amount = 1:10)
prev
diff <- copy(prev) # copy to keep the original data.table unchanged
diff[, `:=`(sales.amount = -sales.amount, vat.amount = -vat.amount)] # negate the amounts so that the sum will be the difference
diff <- rbind(diff, cur) # combine negative previous amounts with positive current amounts so that the sum will be difference
diff # show raw data
diff[, .(last.booking.date=max(booking.date), sales.amount.diff=sum(sales.amount), vat.amount.diff=sum(vat.amount)), by=company] # calculate the difference
# Look at company "A" to verify the result:
cur[company=="A",]
prev[company=="A",]
的示例数据和预期的输出是这样的:
数据表1:本年度的预订:
> cur
company booking.date sales.amount vat.amount
1: A 2011-01-01 201 11
2: B 2011-01-08 202 12
3: New 2011-01-15 203 13
4: A 2011-01-22 204 14
5: B 2011-01-29 205 15
6: New 2011-02-05 206 16
7: A 2011-02-12 207 17
8: B 2011-02-19 208 18
9: New 2011-02-26 209 19
10: A 2011-03-05 210 20
11: B 2011-03-12 211 21
12: New 2011-03-19 212 22
数据表2:前一年的预订:
> prev
company booking.date sales.amount vat.amount
1: A 2010-01-01 101 1
2: B 2010-02-01 102 2
3: A 2010-03-01 103 3
4: B 2010-04-01 104 4
5: A 2010-05-01 105 5
6: B 2010-06-01 106 6
7: A 2010-07-01 107 7
8: B 2010-08-01 108 8
9: A 2010-09-01 109 9
10: B 2010-10-01 110 10
预期结果(每c的差值每次预订一年的总和)的ompany:
company last.booking.date sales.amount.diff vat.amount.diff
1: A 1 2011-03-05 297 37
2: B 1 2011-03-12 296 36
3: New 1 2011-03-19 830 70
不错的选择!两个注意事项:1)只有当你有两个数据表时,这才起作用; 2)'cur_co'和'prev_co'数据表是新的副本,因此得到一个新的内存地址 – Jaap
1)真的,这是OP要求的。你的解决方案虽然更灵活。2)正确,这些是副本,但可能由于聚合而变小。 'rbindlist()'是否阻止分配新内存? –
@Christian Borck Thx为您的答案,首先聚合是一种好方法。我猜'rbindlist'会一直分配新的内存,因为它必须显着增加列的向量大小。 –