numpy.std上memmapped ndarray失败的MemoryError

我有一个巨大的（30GB）ndarray内存映射：numpy.std上memmapped ndarray失败的MemoryError

arr = numpy.memmap(afile, dtype=numpy.float32, mode="w+", shape=(n, n,))

与一些值填充它后（肚里很细 - 最大内存使用率下面1GB）我想计算标准差：

print('stdev: {0:4.4f}\n'.format(numpy.std(arr)))

这符合MemoryError悲惨的失败了。

我不知道为什么会失败。我将不胜感激提示如何以内存有效的方式计算这些内容？

环境：VENV + Python3.6.2 + NumPy的1.13.1

来源

2017-09-07 sophros

import math 
BLOCKSIZE = 1024**2 
# For numerical stability. The closer this is to mean(arr), the better. 
PIVOT = arr[0] 


n = len(arr) 
sum_ = 0. 
sum_sq = 0. 
for block_start in xrange(0, n, BLOCKSIZE): 
    block_data = arr[block_start:block_start + BLOCKSIZE] 
    block_data -= PIVOT 
    sum_ += math.fsum(block_data) 
    sum_sq += math.fsum(block_data**2) 

stdev = np.sqrt(sum_sq/n - (sum_/n)**2)

来源

2017-09-19 16:41:42 sophros

事实上numpy的的实施std和mean使阵列的完整副本，并且是可怕的记忆效率低下。这里是一个更好的实现：

# Memory overhead is BLOCKSIZE * itemsize. Should be at least ~1MB 
# for efficient HDD access. 
BLOCKSIZE = 1024**2 
# For numerical stability. The closer this is to mean(arr), the better. 
PIVOT = arr[0] 

n = len(arr) 
sum_ = 0. 
sum_sq = 0. 
for block_start in xrange(0, n, BLOCKSIZE): 
    block_data = arr[block_start:block_start + BLOCKSIZE] 
    block_data -= PIVOT 
    sum_ += np.sum(block_data) 
    sum_sq += np.sum(block_data**2) 
stdev = np.sqrt(sum_sq/n - (sum_/n)**2)

来源

2017-09-07 16:36:29 kiyo

感谢您使用但是在NumPy的执行'sum'在数值上是不稳定的，正是在大矩阵的情况下，它真的失败草草收场（[见SO问题]（HTTPS： //stackoverflow.com/questions/33004029/is-numpy-sum-implemented-in-such-a-way-that-numerical-errors-are-avoided））。下面更正（尽管较慢）版本。 – sophros

numpy.std上memmapped ndarray失败的MemoryError

回答

相关问题