Python中的大小递增Numpy数组

我刚刚遇到了Python中增量Numpy数组的需求，并且因为我没有找到任何实现它的东西。我只是想知道，如果我的方式是最好的方式，或者你可以想出其他的想法。Python中的大小递增Numpy数组

所以，问题是我有一个二维数组（程序处理nD数组），其大小事先不知道，可变数量的数据需要在一个方向上连接到数组（需要说的是我必须多次致电np.vstak）。每次我连接数据时，都需要将数组沿着轴0进行排序并执行其他操作，因此我无法构造一长串数组，然后立即对列表进行np.vstak。由于内存分配很昂贵，我转向增量数组，我增加了数量大于我需要的数量（我使用50％增量）的数组大小，以便最大限度地减少分配数量。

我这个编码了，你可以看到它在下面的代码：

class ExpandingArray: 

    __DEFAULT_ALLOC_INIT_DIM = 10 # default initial dimension for all the axis is nothing is given by the user 
    __DEFAULT_MAX_INCREMENT = 10 # default value in order to limit the increment of memory allocation 

    __MAX_INCREMENT = [] # Max increment 
    __ALLOC_DIMS = []  # Dimensions of the allocated np.array 
    __DIMS = []    # Dimensions of the view with data on the allocated np.array (__DIMS <= __ALLOC_DIMS) 

    __ARRAY = []   # Allocated array 

    def __init__(self,initData,allocInitDim=None,dtype=np.float64,maxIncrement=None): 
     self.__DIMS = np.array(initData.shape) 

     self.__MAX_INCREMENT = maxIncrement 
     if self.__MAX_INCREMENT == None: 
      self.__MAX_INCREMENT = self.__DEFAULT_MAX_INCREMENT 

     # Compute the allocation dimensions based on user's input 
     if allocInitDim == None: 
      allocInitDim = self.__DIMS.copy() 

     while np.any(allocInitDim < self.__DIMS ) or np.any(allocInitDim == 0): 
      for i in range(len(self.__DIMS)): 
       if allocInitDim[i] == 0: 
        allocInitDim[i] = self.__DEFAULT_ALLOC_INIT_DIM 
       if allocInitDim[i] < self.__DIMS[i]: 
        allocInitDim[i] += min(allocInitDim[i]/2, self.__MAX_INCREMENT) 

     # Allocate memory 
     self.__ALLOC_DIMS = allocInitDim 
     self.__ARRAY = np.zeros(self.__ALLOC_DIMS,dtype=dtype) 

     # Set initData 
     sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))] 
     self.__ARRAY[sliceIdxs] = initData 

    def shape(self): 
     return tuple(self.__DIMS) 

    def getAllocArray(self): 
     return self.__ARRAY 

    def getDataArray(self): 
     """ 
     Get the view of the array with data 
     """ 
     sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))] 
     return self.__ARRAY[sliceIdxs] 

    def concatenate(self,X,axis=0): 
     if axis > len(self.__DIMS): 
      print "Error: axis number exceed the number of dimensions" 
      return 

     # Check dimensions for remaining axis 
     for i in range(len(self.__DIMS)): 
      if i != axis: 
       if X.shape[i] != self.shape()[i]: 
        print "Error: Dimensions of the input array are not consistent in the axis %d" % i 
        return 

     # Check whether allocated memory is enough 
     needAlloc = False 
     while self.__ALLOC_DIMS[axis] < self.__DIMS[axis] + X.shape[axis]: 
      needAlloc = True 
      # Increase the __ALLOC_DIMS 
      self.__ALLOC_DIMS[axis] += min(self.__ALLOC_DIMS[axis]/2,self.__MAX_INCREMENT) 

     # Reallocate memory and copy old data 
     if needAlloc: 
      # Allocate 
      newArray = np.zeros(self.__ALLOC_DIMS) 
      # Copy 
      sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))] 
      newArray[sliceIdxs] = self.__ARRAY[sliceIdxs] 
      self.__ARRAY = newArray 

     # Concatenate new data 
     sliceIdxs = [] 
     for i in range(len(self.__DIMS)): 
      if i != axis: 
       sliceIdxs.append(slice(self.__DIMS[i])) 
      else: 
       sliceIdxs.append(slice(self.__DIMS[i],self.__DIMS[i]+X.shape[i])) 

     self.__ARRAY[sliceIdxs] = X 
     self.__DIMS[axis] += X.shape[axis]

的代码显示了比vstack/hstack几个随机大小的串连大大更好的性能。

我想知道的是：这是最好的方法吗？在numpy中已经有这样做了吗？

此外，能够重载np.array的切片赋值运算符会很好，因此只要用户在实际维度外分配了任何内容，就会执行ExpandingArray.concatenate（）。如何做这样的重载？

测试代码：我在这里发布一些代码，我用它来比较vstack和我的方法。我加起来数据的随机块最大长度100

import time 

N = 10000 

def performEA(N): 
    EA = ExpandingArray(np.zeros((0,2)),maxIncrement=1000) 
    for i in range(N): 
     nNew = np.random.random_integers(low=1,high=100,size=1) 
     X = np.random.rand(nNew,2) 
     EA.concatenate(X,axis=0) 
     # Perform operations on EA.getDataArray() 
    return EA 

def performVStack(N): 
    A = np.zeros((0,2)) 
    for i in range(N): 
     nNew = np.random.random_integers(low=1,high=100,size=1) 
     X = np.random.rand(nNew,2) 
     A = np.vstack((A,X)) 
     # Perform operations on A 
    return A 

start_EA = time.clock() 
EA = performEA(N) 
stop_EA = time.clock() 

start_VS = time.clock() 
VS = performVStack(N) 
stop_VS = time.clock() 

print "Elapsed Time EA: %.2f" % (stop_EA-start_EA) 
print "Elapsed Time VS: %.2f" % (stop_VS-start_VS)

来源

2013-02-22 Daniele Bigoni

不要使用三重引号的字符串进行评论...这不是他们的目的... – mgilson 2013-02-22 13:51:05

很高兴知道。我刚刚看到它:)谢谢 – 2013-02-22 13:51:59

@mgilson：嘿，它的赞同由Guido：[链接]（https://twitter.com/gvanrossum/status/112670605505077248）。我自己做，因为这是值得的。：^） – DSM 2013-02-22 15:04:12

我觉得这些东西最常见的设计模式是只使用一个列表的小数组。当然你可以做一些事情，比如动态调整大小（如果你想做些疯狂的事情，你也可以尝试使用resize数组方法）。我认为一个典型的方法是，当你真的不知道会有多大的事情时，总是把它扩大一倍。当然，如果你知道阵列的规模会有多大，那么只需要预先分配完整的东西是最简单的。

def performVStack_fromlist(N): 
    l = [] 
    for i in range(N): 
     nNew = np.random.random_integers(low=1,high=100,size=1) 
     X = np.random.rand(nNew,2) 
     l.append(X) 
    return np.vstack(l)

我相信有一些使用情况下，不断扩大的阵列可能是有用的（例如当附加阵列都非常小），但这个循环似乎更好地与上面的图案处理。优化主要是关于你需要多长时间一次复制所有内容，并且像这样做一个列表（除了列表本身），这只是一次。所以它通常要快得多。

来源

2013-02-22 15:18:24 seberg

我实际上是避免做这个列表方法，因为每次我连接一些东西时，我还需要对数组执行其他操作（如排序和许多其他事情）。我编辑了带有注释的例子，我需要执行额外的操作。 – 2013-02-22 15:40:24

当我遇到类似问题时，我使用了ndarray.resize（）（http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.resize.html#numpy.ndarray.resize）。大多数情况下，它会避免重新分配+完全复制。我不能保证它会更快（它可能会），但它是如此简单。

至于你的第二个问题，我认为覆盖切片分配用于扩展目的不是一个好主意。该运算符用于分配现有项目/切片。如果你想改变这种状况，它不是立即清楚你如何希望它在某些情况下的行为，例如：

a = MyExtendableArray(np.arange(100)) 
a[200] = 6 # resize to 200? pad [100:200] with what? 
a[90:110] = 7 # assign to existing items AND automagically-allocated items? 
a[::-1][200] = 6 # ...

我的建议是，切片分配和数据附加应保持独立。

来源

2013-02-22 15:36:13 shx2

+1为压倒一切的建议。关于调整大小我喜欢这个建议，但是“引用一个数组可以防止调整大小...”，我可能需要引用外部。 – 2013-02-22 15:50:32

Python中的大小递增Numpy数组

回答

相关问题