从numpy的阵列获取的非重复的行

比方说，我有以下形式的numpy的阵列从numpy的阵列获取的非重复的行

x = np.array([[2, 5], 
       [3, 4], 
       [1, 3], 
       [2, 5], 
       [4, 5], 
       [1, 3], 
       [1, 4], 
       [3, 4]])

我想从这个得到的是只包含其中不重复的行，即数组，我期望从这个例子

array([[4, 5], 
     [1, 4]])

我正在寻找一种方法，是相当快速和规模很好。我能想到做到这一点的唯一方法是

首先找到该组唯一的行中x，作为一个新的阵列y。
创建，其具有从x除去y那些个体元素的数组z，从而z是在x复制的行的列表。
做一个x和z之间的差异。

虽然这似乎非常低效。任何人都有更好的方法？

如果它很重要，我保证我的每一行都会从最小排序到最大排列，因此您永远不会有一行是[5, 2]或[3, 1]。

来源

2016-04-29 zephyr

你为什么认为这样效率低下？使用散列表，这应该是一个O（n）时间算法，这是非常合理的。你不能做得更好，因为你必须看看每个元素。 –

由于必须自己执行循环，我预计它效率低下。我知道没有本地numpy的方法，将执行第2步。 – zephyr

哦，我明白了，但我不认为numpy或熊猫用C代码优化了这些，你可能也想比较运行时间和你自己的循环。 –

方法＃1

下面是基于np.unique并考虑每行作为索引元组的效率（假定输入数组的整数）的方法 -

# Consider each row as indexing tuple & get linear indexing value    
lid = np.ravel_multi_index(x.T,x.max(0)+1) 

# Get counts and unique indices 
_,idx,count = np.unique(lid,return_index=True,return_counts=True) 

# See which counts are exactly 1 and select the corresponding unique indices 
# and thus the correspnding rows from input as the final output 
out = x[idx[count==1]]

注：如果有一个巨大的列数输入数组，你可能希望得到的线性指数lid手动，以便您可以使用np.cumprod，像这样 -

lid = x.dot(np.append(1,(x.max(0)+1)[::-1][:-1].cumprod())[::-1])

方法2

下面是一个可供选择的一个可卸载计数任务np.bincount，这可能是更有效地用于这种目的 -

# Consider each row as indexing tuple & get linear indexing value    
lid = np.ravel_multi_index(x.T,x.max(0)+1) 

# Get unique indices and tagged indices for all elements 
_,unq_idx,tag_idx = np.unique(lid,return_index=True,return_inverse=True) 

# Use the tagged indices to count and look for count==1 and repeat like before 
out = x[unq_idx[np.bincount(tag_idx)==1]]

方法3

下面是使用convolution赶上了不同的方法这样的模式。让内联评论帮助理解底层的想法。这里去 -

# Consider each row as indexing tuple & get linear indexing value    
lid = np.ravel_multi_index(x.T,x.max(0)+1) 

# Store sorted indices for lid 
sidx = lid.argsort() 

# Append 1s at either ends of sorted and differentiated version of lid 
mask = np.hstack((True,np.diff(lid[sidx])!=0,True)) 

# Perform convolution on it. Thus non duplicate elements would have 
# consecutive two True elements, which could be caught with convolution 
# kernel of [1,1]. Get the corresponding mask. 
# Index into sorted indices with it for final output 
out = x[sidx[(np.convolve(mask,[1,1])>1)[1:-1]]]

来源

2016-04-29 21:06:17 Divakar

我喜欢这些方法。我实际上发现你得到的第一种方法比第二种方法快。尽管伟大的工作！ – zephyr

@zephyr啊我明白了。认为这两种方法之间的性能差异取决于输入数组格式！ – Divakar

@zephyr你可以看看刚刚添加的第三种测试方法吗？谢谢！ – Divakar

这里是一个pandas的方法：

pd.DataFrame(x.T).T.drop_duplicates(keep=False).as_matrix() 

#array([[4, 5], 
#  [1, 4]])

来源

2016-04-29 20:50:41

'keep = False'是否必要？我收到了意外的关键字错误。 – zephyr

有时它可以很容易。我希望我在20分钟前想到:-( – MSeifert

一定要更新你的熊猫版本！否则，对于前版本，用两个班轮，你可以使用一个面具'df.duplicated（）| df.duplicated（take_last = True）'删除所有重复项。 –

一种可能性（需要用于包含大量元件的阵列大量的存储器）是通过首先创建一个布尔掩码，其中行等于：

b = np.sum(x[:, None, :] == x, axis=2) 
b 
array([[2, 0, 0, 2, 1, 0, 0, 0], 
     [0, 2, 0, 0, 0, 0, 1, 2], 
     [0, 0, 2, 0, 0, 2, 1, 0], 
     [2, 0, 0, 2, 1, 0, 0, 0], 
     [1, 0, 0, 1, 2, 0, 0, 0], 
     [0, 0, 2, 0, 0, 2, 1, 0], 
     [0, 1, 1, 0, 0, 1, 2, 1], 
     [0, 2, 0, 0, 0, 0, 1, 2]])

该数组显示哪一行具有多少个与另一行相同的元素。对角线是比较自身的行，以便需要设置为零：

np.fill_diagonal(b, 0) 
b 
array([[0, 0, 0, 2, 1, 0, 0, 0], 
     [0, 0, 0, 0, 0, 0, 1, 2], 
     [0, 0, 0, 0, 0, 2, 1, 0], 
     [2, 0, 0, 0, 1, 0, 0, 0], 
     [1, 0, 0, 1, 0, 0, 0, 0], 
     [0, 0, 2, 0, 0, 0, 1, 0], 
     [0, 1, 1, 0, 0, 1, 0, 1], 
     [0, 2, 0, 0, 0, 0, 1, 0]])

现在让我们来看看什么是每一行的最大：

c = np.max(b, axis=0) 
c 
array([2, 2, 2, 2, 1, 2, 1, 2])

，然后我们需要找到该值，其中这个最大是!=2和索引这些从原始数组：

x[np.where([c != 2])[1]] 
array([[4, 5], 
     [1, 4]])

来源

2016-04-29 20:50:52 MSeifert

我喜欢这样可以让所有东西都保持numpy。你能告诉我'x [:, None，：]'这个结构是如何工作的吗？我从来没有见过。 – zephyr

它增加了另一个尺寸。所以结果是3d，空轴为第二轴，原始第二轴为第三轴。另外你还可以使用'np.expand_dims（x，axis = 1）'这更直观一点，但更多的是写:-) – MSeifert

对于completness，又见项目78 http://www.labri.fr/perso/nrougier/teaching/numpy.100/

来源

2016-04-30 05:53:09

这个问题可以有效地利用numpy_indexed包来解决（免责声明：我是它的作者）：

import numpy_indexed as npi 
x[npi.multiplicity(x) == 1]

不只有这个解决方案非常易读，它也非常高效，并且可以与任意数量的列或dtypes一起使用。

来源

2016-04-30 06:47:54

从numpy的阵列获取的非重复的行

回答

相关问题