2017-09-07 11 views
1

我有一个巨大的(30GB)ndarray内存映射:numpy.std上memmapped ndarray失败的MemoryError

arr = numpy.memmap(afile, dtype=numpy.float32, mode="w+", shape=(n, n,))

与一些值填充它后(肚里很细 - 最大内存使用率下面1GB)我想计算标准差:

print('stdev: {0:4.4f}\n'.format(numpy.std(arr)))

这符合MemoryError悲惨的失败了。

我不知道为什么会失败。我将不胜感激提示如何以内存有效的方式计算这些内容?

环境:VENV + Python3.6.2 + NumPy的1.13.1

回答

0
import math 
BLOCKSIZE = 1024**2 
# For numerical stability. The closer this is to mean(arr), the better. 
PIVOT = arr[0] 


n = len(arr) 
sum_ = 0. 
sum_sq = 0. 
for block_start in xrange(0, n, BLOCKSIZE): 
    block_data = arr[block_start:block_start + BLOCKSIZE] 
    block_data -= PIVOT 
    sum_ += math.fsum(block_data) 
    sum_sq += math.fsum(block_data**2) 

stdev = np.sqrt(sum_sq/n - (sum_/n)**2) 
1

事实上numpy的的实施stdmean使阵列的完整副本,并且是可怕的记忆效率低下。这里是一个更好的实现:

# Memory overhead is BLOCKSIZE * itemsize. Should be at least ~1MB 
# for efficient HDD access. 
BLOCKSIZE = 1024**2 
# For numerical stability. The closer this is to mean(arr), the better. 
PIVOT = arr[0] 

n = len(arr) 
sum_ = 0. 
sum_sq = 0. 
for block_start in xrange(0, n, BLOCKSIZE): 
    block_data = arr[block_start:block_start + BLOCKSIZE] 
    block_data -= PIVOT 
    sum_ += np.sum(block_data) 
    sum_sq += np.sum(block_data**2) 
stdev = np.sqrt(sum_sq/n - (sum_/n)**2) 
+0

感谢您使用但是在NumPy的执行'sum'在数值上是不稳定的,正是在大矩阵的情况下,它真的失败草草收场([见SO问题](HTTPS: //stackoverflow.com/questions/33004029/is-numpy-sum-implemented-in-such-a-way-that-numerical-errors-are-avoided))。下面更正(尽管较慢)版本。 – sophros