2D numpy数组搜索（相当于Matlab的相交'行'选项）

我有两个4列numpy数组（2D），每个数组有100个（浮点）行（cap和usp）。考虑每个阵列中3列的子集（例如capind=cap[:,:3]）：2D numpy数组搜索（相当于Matlab的相交'行'选项）

在两个阵列之间有许多共同的行。
每一行元组/“三元组”在每个数组中都是唯一的。

我正在寻找一种有效的方法来识别这两个数组中的常见三个值（行）子集，同时以某种方式保留两个数组中的第四列以供进一步处理。实质上，我正在寻找一种很好的方式来做与Matlab行相同的行选项（即([c, ia, ib]=intersect(capind, uspind, 'rows');)。

），它返回匹配行的索引，以便现在获得匹配的三元组以及从原来的阵列（matchcap=cap[ia,:]）第4列的值。

我目前的做法是基于在论坛上类似的问题，因为我找不到我的问题一个很好的匹配。但是，这种方法似乎有点考虑到我的目标没有效率（我还没有完全解决我的问题）：

该阵列是这样的：

cap=array([[ 2.50000000e+01, 1.27000000e+02, 1.00000000e+00, 
     9.81997200e-06], 
    [ 2.60000000e+01, 1.27000000e+02, 1.00000000e+00, 
     9.14296800e+00], 
    [ 2.70000000e+01, 1.27000000e+02, 1.00000000e+00, 
     2.30137100e-04], 
    ..., 
    [ 6.10000000e+01, 1.80000000e+02, 1.06000000e+02, 
     8.44939900e-03], 
    [ 6.20000000e+01, 1.80000000e+02, 1.06000000e+02, 
     4.77729100e-03], 
    [ 6.30000000e+01, 1.80000000e+02, 1.06000000e+02, 
     1.40343500e-03]]) 

usp=array([[ 4.10000000e+01, 1.31000000e+02, 1.00000000e+00, 
     5.24197200e-06], 
    [ 4.20000000e+01, 1.31000000e+02, 1.00000000e+00, 
     8.39178800e-04], 
    [ 4.30000000e+01, 1.31000000e+02, 1.00000000e+00, 
     1.20279900e+01], 
    ..., 
    [ 4.70000000e+01, 1.80000000e+02, 1.06000000e+02, 
     2.48667700e-02], 
    [ 4.80000000e+01, 1.80000000e+02, 1.06000000e+02, 
     4.23304600e-03], 
    [ 4.90000000e+01, 1.80000000e+02, 1.06000000e+02, 
     1.02051300e-03]])

我然后每4列阵列（USP和帽）转换成一个三列的阵列（capind和下面uspind示出为为了便于观察的整数）。

capind=array([[ 25, 127, 1], 
    [ 26, 127, 1], 
    [ 27, 127, 1], 
    ..., 
    [ 61, 180, 106], 
    [ 62, 180, 106], 
    [ 63, 180, 106]]) 
uspind=array([[ 41, 131, 1], 
    [ 42, 131, 1], 
    [ 43, 131, 1], 
    ..., 
    [ 47, 180, 106], 
    [ 48, 180, 106], 
    [ 49, 180, 106]])

使用set操作给我匹配的三元组：carray=np.array([x for x in set(tuple(x) for x in capind) & set(tuple(x) for x in uspind)])。

这似乎很适合从uspind和capind数组中找到常见行值。我现在需要从匹配的行中获取第4列的值（即，将carray与原始源数组的前三列（cap和usp）进行比较，并以某种方式从第4列中获取值）。

有没有更好的方法来实现这一目标？否则，任何有关从源数组中检索第四列值的最佳方法的帮助将不胜感激。

来源

2014-06-10 ith140

请尝试使用词典。

capind = {tuple(row[:3]):row[3] for row in cap} 
uspind = {tuple(row[:3]):row[3] for row in usp} 

keys = capind.viewkeys() & uspind.viewkeys() 
for key in keys: 
    # capind[key] and uspind[key] are the fourth columns

来源

2014-06-10 15:51:58 nneonneo

这几乎是有一个小correction.'capind = {元组（行[3]）：行[3]行中cap} uspind = {tuple（row [：3]）：row [3] for usp}} – ith140

我想保留数组结构，因为我不想遍历字典。我需要稍后对cap和usp中的常见元素进行一些数组运算。 – ith140

你可以让他们回到事后阵列... – nneonneo

使用假设你已经知道行在每个矩阵中是唯一的，并且存在公共行，这里有一个解决方案。基本的想法是连接两个数组，对它进行排序，使相似的行在一起，然后在行之间做出改变。如果行相同，前三个值应接近于零。

[原文]

## Concatenate the matrices together 
cu = concatenate((cap, usp), axis=0) 
print cu 

## Sort it 
cu.sort(axis=0) 
print cu 

## Do a forward difference from row to row 
cu_diff = diff(cu, n=1, axis=0) 

## Now calculate the sum of the first three columns 
## as it should be zero (or near zero) 
cu_diff_s = sum(abs(cu_diff[:,:-1]), axis=1) 

## Find the indices where it is zero 
## Change this to be <= eps if you are using float numbers 
indices = find(cu_diff_s == 0) 
print indices 

## And here are the rows... 
print cu[indices,:]

我做作基于上面的例子的数据集。它似乎工作。可能有更快的方法来做到这一点，但这样你就不必循环任何东西。（我不喜欢循环:-)）。

[已更新]

好的。所以我在每个矩阵中增加了两列。最后一列是帽子1和USP 2。最后一列仅仅是原始矩阵的索引。

## Store more info in the array 
## The first 4 columns are the initial data 
## The fifth column is a code of 1 or 2 (ie cap or usp) 
## The sixth column is the index into the original matrix 

cap_code = concatenate( (ones((cap.shape[0], 1)), reshape(r_[0:cap.shape[0]], (cap.shape[0], 1))), axis=1) 
cap_info = concatenate((cap, cap_code), axis=1) 

usp_code = concatenate( (2*ones((usp.shape[0], 1)), reshape(r_[0:usp.shape[0]], (usp.shape[0], 1))), axis=1) 
usp_info = concatenate((usp, usp_code), axis=1) 

## Concatenate the matrices together 
cu = concatenate((cap_info, usp_info), axis=0) 
print cu 

## Sort it 
cu.sort(axis=0) 
print cu 

## Do a forward difference from row to row 
cu_diff = diff(cu, n=1, axis=0) 

## Now calculate the sum of the first three columns 
## as it should be zero (or near zero) 
cu_diff_s = sum(abs(cu_diff[:,:3]), axis=1) 

## Find the indices where it is zero 
## Change this to be <= eps if you are using float numbers 
indices = find(cu_diff_s == 0) 
print indices 

## And here are the rows... 
print cu[indices,:] 
print cu[indices+1,:]

它似乎工作基于我的设计数据。它有点令人费解，所以我不认为我会想进一步追求这个方向。

祝你好运！

来源

2014-06-10 17:54:27 brechmos

我认为你应该循环，如果它会使代码更快。通常NumPy让你不必循环，但并非总是如此。 – nneonneo

@nneonneo。当然是。关键是几乎总是基础代码（必须在某个级别循环）比使用Python循环要快。列表解析可能会稍微不同，因为它们已经过优化。 – brechmos

这很接近，但我需要知道哪些数值来自哪个数组。一旦我这样做，我就失去了这一点。 – ith140

Matlab的等效返回使用numpy的行索引是以下内容，它返回一个布尔数组，对于同一行的索引为1：唯一的非重复行的

def find_rows_in_array(arr, rows): 
    ''' 
    find indices of rows in array if they exist 
    ''' 
    tmp = np.prod(np.swapaxes(
     arr[:, :, None], 1, 2) == rows, axis=2) 
    return np.sum(np.cumsum(tmp, axis=0) * tmp == 1, 
        axis=1) > 0

上述返回指数。如果你想返回每一个可能的行，然后：

def find_rows_in_array(arr, rows): 
    ''' 
    find indices of rows in array if they exist 
    ''' 
    tmp = np.prod(np.swapaxes(
     arr[:, :, None], 1, 2) == rows, axis=2) 
    return np.sum(tmp, 
        axis=1) > 0

这是更快。您可以将数组交换为输入，以便为每个数组找到相应的索引。享受：d

来源

2016-11-21 16:31:00

的numpy_indexed包（免责声明：我是它的作者）包含了所有你需要的功能，以高效的方式实现（即全矢量，蟒水平，因此没有慢环路）：

import numpy_indexed as npi 
c = npi.intersection(capind, uspind) 
ia = npi.indices(capind, c) 
ib = npi.indices(uspind, c)

取决于你如何看重简洁VS性能，你可能更喜欢：

import numpy_indexed as npi 
a = npi.as_index(capind) 
b = npi.as_index(uspind) 
c = npi.intersection(a, b) 
ia = npi.indices(a, c) 
ib = npi.indices(b, c)

来源

2016-11-21 18:38:03

2D numpy数组搜索（相当于Matlab的相交'行'选项）

回答

相关问题