大型数据集

更快grpstats我有一个大的数据集Matlab的（1924014由5;〜73.4 MB）大型数据集

Date   id   a   b   c 
... 
733234  1467   1.2656  1.2718  51.16  
733235  1467   1.2732  1.2794  51.16  
733236  1467   1.2781  1.2844  51.5  
733236  1467   1.26   NaN  NaN  
733237  1467   1.3084   NaN  NaN  
733237  1467   1.3205   NaN  NaN  
733238  1467   1.3125  1.3188  53.85  
733238  1467    1.3   NaN  NaN  
...

Date是datenum形式的日期。
我需要平均（忽略NaN s）最后三列的唯一Date + id对，因为有时对于给定的Date + id对有多于一行。

我想输出是

Date   id   mean_a  mean_b  mean_c 
... 
735234  1467   1.2656  1.2718  51.16  
735235  1467   1.2732  1.2794  51.16  
735236  1467   1.2691  1.2844  51.5  
735237  1467   1.3144   NaN  NaN  
735238  1467   1.3062  1.3188  53.85  
...

我希望能够使用

grpstats(myDataset, {'Date', 'id'}, 'mean')

但它是慢得。我预计这项任务可以在60秒内完成。我认为grpstats正在添加一个GroupCount列，并为每个观察值添加名称，这些我不需要。

我该如何快速做到这一点？无论他们是否使用grpstats，我都乐于接受。

来源

2013-06-19 Prashant Kumar

集团按日期和id与unique(...,'rows')，进而产生累加subs多个列与meshgrid()，或者明确地repmat()，最后采取了@nanmean与accumarray()：

% Group by date and id 
[un,~,pos] = unique(db(:,1:2),'rows'); 

% Produce row, col subs 
[col,row] = meshgrid(1:3,pos); 

% Accumulate 
[un accumarray([row(:), col(:)], reshape(db(:,3:5),[],1),[],@nanmean)]

来源

2013-06-19 16:42:00 Oleg

非常有前途的！在我的机器上不到30秒。我真的需要学习如何使用meshgrid/reshape。现在检查输出... –

时间meshgrid，如果它足够长，例如1/3的时间，我会将repmat方法发布到subs创作。 – Oleg

数据对我来说很好！这对我的目的来说很快。就在这个时候，99％的时间花在了积累上。 –

回答

相关问题