我有大约3000个文件。每个文件都有大约55000行/标识符和大约100列。我需要计算每个文件的行方向相关性或加权协方差(取决于文件中的列数)。所有文件中的行数相同。我想知道为每个文件计算相关矩阵的最有效方法是什么?我已经尝试过Perl和C++,但是它需要花费很多时间来处理一个文件 - Perl需要6天,C需要一天以上的时间。通常情况下,我不想每个文件花费15-20分钟以上。行计算相关/协方差矩阵的有效方法
现在,我想知道如果我可以使用一些技巧或东西更快地处理它。这里是我的伪代码:
while (using the file handler)
reading the file line by line
Storing the column values in hash1 where the key is the identifier
Storing the mean and ssxx (Sum of Squared Deviations of x to the mean) to the hash2 and hash3 respectively (I used hash of hashed in Perl) by calling the mean and ssxx function
end
close file handler
for loop traversing the hash (this is nested for loop as I need values of 2 different identifiers to calculate correlation coefficient)
calculate ssxxy by calling the ssxy function i.e. Sum of Squared Deviations of x and y to their mean
calculate correlation coefficient.
end
现在,我计算一对的相关系数只有一次,我没有计算相同标识符的相关系数。我已经采取我的嵌套for循环照顾。你认为是否有办法更快地计算相关系数?任何提示/建议都会很棒。谢谢!
EDIT1: 我输入文件看起来是这样的 - 前10个标识符:
"Ident_01" 6453.07 8895.79 8145.31 6388.25 6779.12
"Ident_02" 449.803 367.757 302.633 318.037 331.55
"Ident_03" 16.4878 198.937 220.376 91.352 237.983
"Ident_04" 26.4878 398.937 130.376 92.352 177.983
"Ident_05" 36.4878 298.937 430.376 93.352 167.983
"Ident_06" 46.4878 498.937 560.376 94.352 157.983
"Ident_07" 56.4878 598.937 700.376 95.352 147.983
"Ident_08" 66.4878 698.937 990.376 96.352 137.983
"Ident_09" 76.4878 798.937 120.376 97.352 117.983
"Ident_10" 86.4878 898.937 450.376 98.352 127.983
EDIT2:这里是段/子程序或者说,我在Perl写的功能
## Pearson Correlation Coefficient
sub correlation {
my($arr1, $arr2) = @_;
my $ssxy = ssxy($arr1->{string}, $arr2->{string}, $arr1->{mean}, $arr2->{mean});
my $cor = $ssxy/sqrt($arr1->{ssxx} * $arr2->{ssxx});
return $cor ;
}
## Mean
sub mean {
my $arr1 = shift;
my $mu_x = sum(@$arr1) /scalar(@$arr1);
return($mu_x);
}
## Sum of Squared Deviations of x to the mean i.e. ssxx
sub ssxx {
my ($arr1, $mean_x) = @_;
my $ssxx = 0;
## looping over all the samples
for(my $i = 0; $i < @$arr1; $i++){
$ssxx = $ssxx + ($arr1->[$i] - $mean_x)**2;
}
return($ssxx);
}
## Sum of Squared Deviations of xy to the mean i.e. ssxy
sub ssxy {
my($arr1, $arr2, $mean_x, $mean_y) = @_;
my $ssxy = 0;
## looping over all the samples
for(my $i = 0; $i < @$arr1; $i++){
$ssxy = $ssxy + ($arr1->[$i] - $mean_x) * ($arr2->[$i] - $mean_y);
}
return ($ssxy);
}
您能否提供典型输入文件的摘录? – MBo 2014-09-28 05:44:44
已添加文件的前10行。 – snape 2014-09-28 06:34:18
除了性能问题,您的计算可能不正确。 – 2014-09-28 12:43:40