2016-02-20 19 views
1

我有很多预订数据(数百万行),并且想要计算存储在两个单独数据表中的不同年份的相同组之间的预订金额的更改(差异=扣除)。R data.table:如何优化每个对应组的两个数据表之间的差值计算?

我可以用这个伟大的data.table如下图所示的代码,但如何可以将代码进行优化(关于性能和内存消耗)因为我coyping数据(表),并有几个计算步骤做可能会一次完成?

# Calculate value differences for the same group of data in two different data.tables 
cur <- data.table(company=c("A", "B", "New"), booking.date=seq(from=as.Date("2011/01/01"), by="week", length.out=12), sales.amount = 201:212, vat.amount = 11:22) 
cur 

prev <- data.table(company=c("A", "B"), booking.date=seq(from=as.Date("2010/01/01"), by="month", length.out=10), sales.amount = 101:110, vat.amount = 1:10) 
prev 

diff <- copy(prev) # copy to keep the original data.table unchanged 
diff[, `:=`(sales.amount = -sales.amount, vat.amount = -vat.amount)] # negate the amounts so that the sum will be the difference 
diff <- rbind(diff, cur) # combine negative previous amounts with positive current amounts so that the sum will be difference 
diff # show raw data 
diff[, .(last.booking.date=max(booking.date), sales.amount.diff=sum(sales.amount), vat.amount.diff=sum(vat.amount)), by=company] # calculate the difference 

# Look at company "A" to verify the result: 
cur[company=="A",] 
prev[company=="A",] 

的示例数据和预期的输出是这样的:

数据表1:本年度的预订:

> cur 
    company booking.date sales.amount vat.amount 
1:  A 2011-01-01   201   11 
2:  B 2011-01-08   202   12 
3:  New 2011-01-15   203   13 
4:  A 2011-01-22   204   14 
5:  B 2011-01-29   205   15 
6:  New 2011-02-05   206   16 
7:  A 2011-02-12   207   17 
8:  B 2011-02-19   208   18 
9:  New 2011-02-26   209   19 
10:  A 2011-03-05   210   20 
11:  B 2011-03-12   211   21 
12:  New 2011-03-19   212   22 

数据表2:前一年的预订:

> prev 
    company booking.date sales.amount vat.amount 
1:  A 2010-01-01   101   1 
2:  B 2010-02-01   102   2 
3:  A 2010-03-01   103   3 
4:  B 2010-04-01   104   4 
5:  A 2010-05-01   105   5 
6:  B 2010-06-01   106   6 
7:  A 2010-07-01   107   7 
8:  B 2010-08-01   108   8 
9:  A 2010-09-01   109   9 
10:  B 2010-10-01   110   10 

预期结果(每c的差值每次预订一年的总和)的ompany:

company last.booking.date sales.amount.diff vat.amount.diff 
1:  A 1  2011-03-05    297    37 
2:  B 1  2011-03-12    296    36 
3: New 1  2011-03-19    830    70 

回答

5

@Jaap

另一种方式的尼斯做法没有原始表结合在一起可以是:

# aggregate tables by company 
cur_co <- cur[, .(last.booking.date = max(booking.date), 
        sales.amount = sum(sales.amount), 
        vat.amount = sum(vat.amount)), 
       by=company] 

prev_co <- prev[, .(sales.amount = sum(sales.amount), 
        vat.amount = sum(vat.amount)), 
       by=company] 


# join & get difference 
cur_co[prev_co, c("sales.amount.diff", "vat.amount.diff") := 
      .(sales.amount - i.sales.amount, vat.amount - i.vat.amount), 
     on="company"] 

# fill NA's (companies missing in previuos year) 
cur_co[is.na(sales.amount.diff), 
     c("sales.amount.diff", "vat.amount.diff") := 
      .(sales.amount, vat.amount)] 

# drop unused columns 
cur_co[, c("sales.amount", "vat.amount") := NULL] 

赋予完全相同的输出:

company last.booking.date sales.amount.diff vat.amount.diff 
1:  A  2011-03-05    297    37 
2:  B  2011-03-12    296    36 
3:  New  2011-03-19    830    70 
+0

不错的选择!两个注意事项:1)只有当你有两个数据表时,这才起作用; 2)'cur_co'和'prev_co'数据表是新的副本,因此得到一个新的内存地址 – Jaap

+0

1)真的,这是OP要求的。你的解决方案虽然更灵活。2)正确,这些是副本,但可能由于聚合而变小。 'rbindlist()'是否阻止分配新内存? –

+0

@Christian Borck Thx为您的答案,首先聚合是一种好方法。我猜'rbindlist'会一直分配新的内存,因为它必须显着增加列的向量大小。 –

4

这可能是原来的数据表绑定在一起,然后做计算最简单的方法:

# bind the data.table's together into one 
dt.all <- rbindlist(list(cur,prev)) 
# set the key to 'company' and 'booking.date' 
# the data.table is now also ordered by these two columns 
setkey(dt.all, company, booking.date) 

dt.all[, .(last.booking.date = booking.date[.N], 
      sales.amount.diff = sum(sales.amount[year(booking.date)==2011]) - sum(sales.amount[year(booking.date)==2010]), 
      vat.amount.diff = sum(vat.amount[year(booking.date)==2011]) - sum(vat.amount[year(booking.date)==2010])), 
     company] 

给出:

company last.booking.date sales.amount.diff vat.amount.diff 
1:  A  2011-03-05    297    37 
2:  B  2011-03-12    296    36 
3:  New  2011-03-19    830    70 

因为当你有多年的时候,更好的方法可能是:

dt.all[, .(last.booking.date = booking.date[.N], 
      sum.sales = sum(sales.amount), 
      sum.vat = sum(vat.amount)), 
     .(company, year(booking.date)) 
     ][, `:=` (last.booking.date = last.booking.date[.N], 
       sales.amount.diff = sum.sales - shift(sum.sales), 
       vat.amount.diff = sum.vat - shift(sum.vat)), 
     company][] 

这给:

company year last.booking.date sum.sales sum.vat sales.amount.diff vat.amount.diff 
1:  A 2010  2011-03-05  525  25    NA    NA 
2:  A 2011  2011-03-05  822  62    297    37 
3:  B 2010  2011-03-12  530  30    NA    NA 
4:  B 2011  2011-03-12  826  66    296    36 
5:  New 2011  2011-03-19  830  70    NA    NA 

添加fill = 0shift参数将导致:

company year last.booking.date sum.sales sum.vat sales.amount.diff vat.amount.diff 
1:  A 2010  2011-03-05  525  25    525    25 
2:  A 2011  2011-03-05  822  62    297    37 
3:  B 2010  2011-03-12  530  30    530    30 
4:  B 2011  2011-03-12  826  66    296    36 
5:  New 2011  2011-03-19  830  70    830    70 
+0

要绑定的DataTable原一起似乎是一个必要前提对我来说太!我不知道'shift',谢谢你的答案,我会试试! –

+0

@RYoda让我知道它是否有效。 'shift'是* data.table *的快速'lag' /'lead'功能实现。有关更多详细信息,请参阅“shift”。 – Jaap

+0

关于你的第一行,'rbind(cur,prev)'调用'rbindlist',如果两者都是data.tables。 – Frank

相关问题