有效计算NumPy中独特子阵列的出现次数？

我有一个形状为(128, 36, 8)的数组，我希望找到最后一维中长度为8的唯一子阵列的出现次数。有效计算NumPy中独特子阵列的出现次数？

我知道np.unique和np.bincount，但那些似乎是为了元素而不是子数组。我见过this question，但它是关于找到特定子阵列的第一次出现，而不是所有独特子阵列的计数。

2015-06-16 Will

我想不出在numpy内部做到这一点的方法，但是[trie]（https://en.wikipedia.org/wiki/Trie）是否太慢？只需要访问每个元素一次，然后最后自动获得唯一子阵列的数量以及它们的位置（如果存储它们）。 – KobeJohn

这是一个密切相关的问题，http://stackoverflow.com/questions/8560440/removing-duplicate-columns-and-rows-from-a-numpy-2d-array。基本的想法是你排序子排列（字典排序）。一旦相似的子阵列被分组，识别和计数它们是微不足道的。 –

的问题指出，输入数组是形状(128, 36, 8)的和我们感兴趣的是在最后一维查找长度8的独特的子阵列。所以，我假设唯一性是沿着前两个维度合并在一起的。让我们假设A作为输入3D数组。

获取独特的子阵列

# Reshape the 3D array to a 2D array merging the first two dimensions 
Ar = A.reshape(-1,A.shape[2]) 

# Perform lex sort and get the sorted indices and xy pairs 
sorted_idx = np.lexsort(Ar.T) 
sorted_Ar = Ar[sorted_idx,:] 

# Get the count of rows that have at least one TRUE value 
# indicating presence of unique subarray there 
unq_out = np.any(np.diff(sorted_Ar,axis=0),1).sum()+1

采样运行的数量 -

In [159]: A # A is (2,2,3) 
Out[159]: 
array([[[0, 0, 0], 
     [0, 0, 2]], 

     [[0, 0, 2], 
     [2, 0, 1]]]) 

In [160]: unq_out 
Out[160]: 3

获取独特的子阵列

# Reshape the 3D array to a 2D array merging the first two dimensions 
Ar = A.reshape(-1,A.shape[2]) 

# Perform lex sort and get the sorted indices and xy pairs 
sorted_idx = np.lexsort(Ar.T) 
sorted_Ar = Ar[sorted_idx,:] 

# Get IDs for each element based on their uniqueness 
id = np.append([0],np.any(np.diff(sorted_Ar,axis=0),1).cumsum()) 

# Get counts for each ID as the final output 
unq_count = np.bincount(id)

采样运行中出现的次数 -

In [64]: A 
Out[64]: 
array([[[0, 0, 2], 
     [1, 1, 1]], 

     [[1, 1, 1], 
     [1, 2, 0]]]) 

In [65]: unq_count 
Out[65]: array([1, 2, 1], dtype=int64)

来源

2015-06-17 05:26:21 Divakar

这真是太棒了 - 我没有想过要使用'np.lexsort'，但我不知道'np.diff'，但我们确实很想找到独特子阵列的*次数，而不仅仅是独特子阵列的数量。作为@ farhawa的答案，这种方法是否可以适用于返回独特的子阵列及其计数？ – Will

太棒了，谢谢。顺便说一句，我对原始答案的修改似乎比你的扩展稍快一些：〜668μsvs〜685μs。 – Will

@很好！如果可能的话，在更大的数据集上测试它，比如'（1000，1000，8）'。 – Divakar

我不知道这是最有效的方法，但这应该工作。

arr = arr.reshape(128*36,8) 
unique_ = [] 
occurence_ = [] 

for sub in arr: 
    if sub.tolist() not in unique_: 
     unique_.append(sub.tolist()) 
     occurence_.append(1) 
    else: 
     occurence_[unique_.index(sub.tolist())]+=1 
for index_,u in unique_: 
    print u,"occurrence: %s"%occurence_[index_]

来源

2015-06-16 23:14:15 farhawa

这会工作，但我希望避免使用本地Python的函数，如'tolist'和'index'，这些代价很高。但感谢您的答案。 – Will

顺便说一下，对你的方法的一个明显的优化是将计数保存在一个字典中，其中的键是子数组的元组，而不是在我们需要用'unique_.index'继续搜索的列表中。 – Will

@会甚至更好，使用'collections.Counter'，'counts = Counter（arr中的行的元组（行））':) –

这里我修改@ Divakar的非常有用的答案返回独特的子阵列的数量，以及子阵列本身，从而使输出是一样的，即collections.Counter.most_common()：

# Get the array in 2D form. 
arr = arr.reshape(-1, arr.shape[-1]) 

# Lexicographically sort 
sorted_arr = arr[np.lexsort(arr.T), :] 

# Get the indices where a new row appears 
diff_idx = np.where(np.any(np.diff(sorted_arr, axis=0), 1))[0] 

# Get the unique rows 
unique_rows = [sorted_arr[i] for i in diff_idx] + [sorted_arr[-1]] 

# Get the number of occurences of each unique array (the -1 is needed at 
# the beginning, rather than 0, because of fencepost concerns) 
counts = np.diff(
    np.append(np.insert(diff_idx, 0, -1), sorted_arr.shape[0] - 1)) 

# Return the (row, count) pairs sorted by count 
return sorted(zip(unique_rows, counts), key=lambda x: x[1], reverse=True)

来源

2015-06-17 20:31:49 Will

有效计算NumPy中独特子阵列的出现次数？

回答

相关问题