2016-03-03 51 views
3

ffbase提供了功能ffdfdply来分割和聚合数据行。这个答案(https://stackoverflow.com/a/20954315/336311)解释了这基本上可以工作。我仍然无法弄清楚如何分割多列。如何将多个列拆分/聚合大型数据框(ffdf)?

我的挑战是分裂变量是必需的。对于两个变量的每个组合,这个必须是唯一的,我想分开。不过,在我的4列数据框(大约50M行)中,如果通过paste()创建字符向量,则需要大量内存。

这是我卡住了...

require("ff") 
require("ffbase") 
load.ffdf(dir="ffdf.shares.02") 

# Aggregation by articleID/measure 
levels(ffshares$measure) # "comments", "likes", "shares", "totals", "tw" 
splitBy = paste(as.character(ffshares$articleID), ffshares$measure, sep="") 

tmp = ffdfdply(fftest, split=splitBy, FUN=function(x) { 
    return(list(
    "articleID" = x[1,"articleID"], 
    "measure" = x[1,"measure"], 
    # I need vectors for each entry 
    "sx" = unlist(x$value), 
    "st" = unlist(x$time) 
)) 
} 
) 

当然,我可以用更短的水平ffshares$measure或简单地用数字代码,但是这是splitBy增长仍然不会解决根本问题非常大。

样本数据

articleID measure    time value 
100  41 shares 2015-01-03 23:20:34  4 
101  41  tw 2015-01-03 23:30:30 24 
102  41 totals 2015-01-03 23:30:38  6 
103  41 likes 2015-01-03 23:30:38  2 
104  41 comments 2015-01-03 23:30:38  0 
105  41 shares 2015-01-03 23:30:38  4 
106  41  tw 2015-01-03 23:40:24 24 
107  41 totals 2015-01-03 23:40:35  6 
108  41 likes 2015-01-03 23:40:35  2 
... 
1000  42 shares 2015-01-04 20:10:50  0 
1001  42  tw 2015-01-04 21:10:45 24 
1002  42 totals 2015-01-04 21:10:35  0 
1003  42 likes 2015-01-04 21:10:35  0 
1004  42 comments 2015-01-04 21:10:35  0 
1005  42 shares 2015-01-04 21:10:35  0 
1006  42  tw 2015-01-04 22:10:45 24 
1007  42 totals 2015-01-04 22:10:43  0 
1008  42 likes 2015-01-04 22:10:43  0 
... 
+0

你能提供的示例数据? –

+0

不客气。这是非常简单的数据 - 只是很多:) – BurninLeo

回答

3
# Use this, this makes sure your data does not get into RAM completely but only in chunks of 100000 records 
ffshares$splitBy <- with(ffshares[c("articleID", "measure")], paste(articleID, measure, sep=""), 
         by = 100000) 
length(levels(ffshares$splitBy)) ## how many levels are in there - don't know from your question 

tmp <- ffdfdply(ffshares, split=ffshares$splitBy, FUN=function(x) { 
    ## In x you are getting a data.frame in RAM with all records of possibly several articleID/measure combinations 
    ## You should write a function which returns a data.frame. E.g. the following returns the mean value by articleID/measure and the first and last timepoint 
    x <- data.table::setDT(x) 
    xagg <- x[, list(value = mean(value), 
        first.timepoint = min(time), 
        last.timepoint = max(time)), by = list(articleID, measure)] 
    ## the function should return a data frame as indicated in the help of ffdfdply, not a list 
    setDF(xagg) 
}) 
## tmp is an ffdf 
+0

呵呵,paste()和ffdfdply()命令都让R工作了一段时间。可能是由于我的数据中有40万个错位。尽管如此,你的解决方案还是有效的非常感谢! – BurninLeo