ffbase
提供了功能ffdfdply
来分割和聚合数据行。这个答案(https://stackoverflow.com/a/20954315/336311)解释了这基本上可以工作。我仍然无法弄清楚如何分割多列。如何将多个列拆分/聚合大型数据框(ffdf)?
我的挑战是分裂变量是必需的。对于两个变量的每个组合,这个必须是唯一的,我想分开。不过,在我的4列数据框(大约50M行)中,如果通过paste()
创建字符向量,则需要大量内存。
这是我卡住了...
require("ff")
require("ffbase")
load.ffdf(dir="ffdf.shares.02")
# Aggregation by articleID/measure
levels(ffshares$measure) # "comments", "likes", "shares", "totals", "tw"
splitBy = paste(as.character(ffshares$articleID), ffshares$measure, sep="")
tmp = ffdfdply(fftest, split=splitBy, FUN=function(x) {
return(list(
"articleID" = x[1,"articleID"],
"measure" = x[1,"measure"],
# I need vectors for each entry
"sx" = unlist(x$value),
"st" = unlist(x$time)
))
}
)
当然,我可以用更短的水平ffshares$measure
或简单地用数字代码,但是这是splitBy
增长仍然不会解决根本问题非常大。
样本数据
articleID measure time value
100 41 shares 2015-01-03 23:20:34 4
101 41 tw 2015-01-03 23:30:30 24
102 41 totals 2015-01-03 23:30:38 6
103 41 likes 2015-01-03 23:30:38 2
104 41 comments 2015-01-03 23:30:38 0
105 41 shares 2015-01-03 23:30:38 4
106 41 tw 2015-01-03 23:40:24 24
107 41 totals 2015-01-03 23:40:35 6
108 41 likes 2015-01-03 23:40:35 2
...
1000 42 shares 2015-01-04 20:10:50 0
1001 42 tw 2015-01-04 21:10:45 24
1002 42 totals 2015-01-04 21:10:35 0
1003 42 likes 2015-01-04 21:10:35 0
1004 42 comments 2015-01-04 21:10:35 0
1005 42 shares 2015-01-04 21:10:35 0
1006 42 tw 2015-01-04 22:10:45 24
1007 42 totals 2015-01-04 22:10:43 0
1008 42 likes 2015-01-04 22:10:43 0
...
你能提供的示例数据? –
不客气。这是非常简单的数据 - 只是很多:) – BurninLeo