关于在我的代码中加快/ if语句的建议？

我的代码需要大约两个小时来处理。瓶颈是for循环，如果声明（请参阅代码中的注释）。我是初学者与蟒蛇:)任何人都可以推荐一种高效的Python方法来替换嵌套和if语句？关于在我的代码中加快/ if语句的建议？

我有〜3000万行的表，每一行与（X，Y，Z）值：

20.0 11.3 7
21.0 11.3 0
22.0 11.3 3
...

我所需的输出是x，y，min（z），count（min（z））形式的表格。最后的列是该（x，y）处最小z值的最终计数。例如：

20.0 11.3 7 7
21.0 11.3 0 10
22.0 11.3 3 1
...

这里只有大约600独特的坐标，所以输出表将被600x4。我的代码：

import numpy as np 
file = open('input.txt','r'); 

coordset = set() 
data = np.zeros((600,4))*np.nan 
irow = 0 
ctr = 0 

for row in file: 
    item = row.split() 
    x = float(item[0]) 
    y = float(item[1]) 
    z = float(item[2]) 

    # build unique grid of coords 
    if ((x,y)) not in coordset: 
     data[irow][0] = x 
     data[irow][1] = y 
     data[irow][2] = z 
     irow = irow + 1  # grows up to 599 

    # lookup table of unique coords 
    coordset.add((x,y)) 

    # BOTTLENECK. replace ifs? for? 
    for i in range(0, irow): 
     if data[i][0]==x and data[i][1]==y: 
      if z > data[i][2]: 
       continue 
      elif z==data[i][2]: 
       ctr = ctr + 1 
       data[i][3]=ctr 
      if z < data[i][2]: 
       data[i][2] = z 
       ctr = 1 
       data[i][3]=ctr

编辑：上供@Joowani的方法计算在1m26s。我原来的方法，相同的电脑，相同的数据文件，106m23s。 编辑2： @Ophion和@Sibster感谢您的建议，我没有足够的信用+1 +1有用的答案。

来源

2013-08-19 3150

是3千万行真的很棒保存在TXT？你不应该看一些更复杂的格式来保存（并读入）你的数据吗？此外，我建议尽可能向量化（numpy），因为这会将for-loops推入numpy，这是C（因此更快） – usethedeathstar

您的解决方案似乎很慢，因为它会在您更新时通过每个遍历列表（即数据）。一个更好的方法是使用一个字典，它将O（1）与每个更新的O（n）相对照。

这里是使用字典我的解决方案：

file = open('input.txt', 'r') #coordinates c = {} for line in file: #items (x, y, z) = (float(n) for n in line.split()) if (x, y) not in c: c[(x, y)] = [z, 1] elif c[(x, y)][0] > z: c[(x, y)][0], c[(x, y)][1] = z, 1 elif c[(x, y)][0] == z: c[(x, y)][1] += 1 for key in c: print("{} {} {} {}".format(key[0], key[1], c[key][0], c[key][1]))

来源

2013-08-19 07:13:23 Joohwan

要添加一些，我找到了一个引用（http://wiki.python.org/moin/TimeComplexity）这解释了字典操作中的一些效率。 – 3150

为什么不改变最后如果elif？

就像现在这样做，您将在循环的每次迭代中评估z < data[i][2]:。

你甚至可以只用一个人换掉它，因为你已经检查if z>data[i][2]和z == data[i][2]所以剩下的唯一可能性是z < data[i][2]:

所以，下面的代码会做同样的，应该是更快：

 if z > data[i][2]: 
      continue 
     elif z==data[i][2]: 
      ctr = ctr + 1 
      data[i][3]=ctr 
     else: 
      data[i][2] = z 
      ctr = 1 
      data[i][3]=ctr

来源

2013-08-19 07:00:30 Sibster

要numpy的使用np.unique做到这一点。

def count_unique(arr): 
    row_view=np.ascontiguousarray(a).view(np.dtype((np.void,a.dtype.itemsize * a.shape[1]))) 
    ua, uind = np.unique(row_view,return_inverse=True) 
    unique_rows = ua.view(a.dtype).reshape(ua.shape + (-1,)) 
    count=np.bincount(uind) 
    return np.hstack((unique_rows,count[:,None]))

首先让检查了一小阵：

a=np.random.rand(10,3) 
a=np.around(a,0) 

print a 
[[ 0. 0. 0.] 
[ 0. 1. 1.] 
[ 0. 1. 0.] 
[ 1. 0. 0.] 
[ 0. 1. 1.] 
[ 1. 1. 0.] 
[ 1. 0. 1.] 
[ 1. 0. 1.] 
[ 1. 0. 0.] 
[ 0. 0. 0.]] 

print output 
[[ 0. 0. 0. 2.] 
[ 0. 1. 0. 1.] 
[ 0. 1. 1. 2.] 
[ 1. 0. 0. 2.] 
[ 1. 0. 1. 2.] 
[ 1. 1. 0. 1.]] 

print np.sum(output[:,-1]) 
10

看起来不错！现在，让我们检查一个大阵：

a=np.random.rand(3E7,3) 
a=np.around(a,1) 

output=count_unique(a) 
print output.shape 
(1331, 4) #Close as I can get to 600 unique elements. 

print np.sum(output[:,-1]) 
30000000.0

注意到约33第二个我的机器上的3GB内存，所有内存中做这个大型阵列很可能是你的瓶颈。对于参考@ Joowani的解决方案花了大约130秒，虽然这是一个苹果和橘子比较，因为我们开始与一个numpy阵列。你的milage可能会有所不同。

要作为numpy的阵列我想查看的问题here数据的读取，但它看起来应该像下面这样：

arr=np.genfromtxt("./input.txt", delimiter=" ")

加载在如此多的数据从一个txt文件，我真的建议使用该链接中的pandas示例。

来源

2013-08-19 13:11:50 Daniel

关于在我的代码中加快/ if语句的建议？

回答

相关问题