2016-02-11 30 views
3

我一直在试图优化一段涉及大型多维数组计算的python代码。我得到了与伦巴相违背的结果。我在MBP上运行,2015年年中,2.5 GHz i7 quadcore,OS 10.10.5,python 2.7.11。考虑以下几点:numba guvectorize target ='parallel'slow than target ='cpu'

import numpy as np 
from numba import jit, vectorize, guvectorize 
import numexpr as ne 
import timeit 

def add_two_2ds_naive(A,B,res): 
    for i in range(A.shape[0]): 
     for j in range(B.shape[1]): 
      res[i,j] = A[i,j]+B[i,j] 

@jit 
def add_two_2ds_jit(A,B,res): 
    for i in range(A.shape[0]): 
     for j in range(B.shape[1]): 
      res[i,j] = A[i,j]+B[i,j] 

@guvectorize(['float64[:,:],float64[:,:],float64[:,:]'], 
    '(n,m),(n,m)->(n,m)',target='cpu') 
def add_two_2ds_cpu(A,B,res): 
    for i in range(A.shape[0]): 
     for j in range(B.shape[1]): 
      res[i,j] = A[i,j]+B[i,j] 

@guvectorize(['(float64[:,:],float64[:,:],float64[:,:])'], 
    '(n,m),(n,m)->(n,m)',target='parallel') 
def add_two_2ds_parallel(A,B,res): 
    for i in range(A.shape[0]): 
     for j in range(B.shape[1]): 
      res[i,j] = A[i,j]+B[i,j] 

def add_two_2ds_numexpr(A,B,res): 
    res = ne.evaluate('A+B') 

if __name__=="__main__": 
    np.random.seed(69) 
    A = np.random.rand(10000,100) 
    B = np.random.rand(10000,100) 
    res = np.zeros((10000,100)) 

我现在可以在各种功能运行timeit:

%timeit add_two_2ds_jit(A,B,res) 
1000 loops, best of 3: 1.16 ms per loop 

%timeit add_two_2ds_cpu(A,B,res) 
1000 loops, best of 3: 1.19 ms per loop 

%timeit add_two_2ds_parallel(A,B,res) 
100 loops, best of 3: 6.9 ms per loop 

%timeit add_two_2ds_numexpr(A,B,res) 
1000 loops, best of 3: 1.62 ms per loop 

看来,使用大多数单核的“水货”没有再碰,因为它的使用情况top显示python的“并行”命中〜40%cpu,“cpu”约为100%,并且命中达到〜300%。

+0

但是'guvectorize'的意义在于你定义的操作被应用在任何_extra_维度上(这将是并行完成的)。您编写的代码不会自行并行。因此,如果将'A','B'和'res'更改为形状'(10000,100,100)',则第三维的100个不同迭代将并行运行。 – DavidW

+0

谢谢,我看到我误解了用法。 –

回答

5

您的@guvectorize实现有两个问题。首先是你正在做@guvectorize内核的所有循环,所以Numba并行目标实际上没有什么并行化。在ufunc/gufunc中,@vectorize和@guvectorize都在广播维度上并行化。由于你的gufunc的签名是2D的,而你的输入是2D的,所以对内部函数只有一次调用,这就解释了你看到的CPU占用率只有100%。

写你有以上的功能,最好的方法是使用常规的ufunc:

@vectorize('(float64, float64)', target='parallel') 
def add_ufunc(a, b): 
    return a + b 

然后我的系统上,我看到这样的速度:

%timeit add_two_2ds_jit(A,B,res) 
1000 loops, best of 3: 1.87 ms per loop 

%timeit add_two_2ds_cpu(A,B,res) 
1000 loops, best of 3: 1.81 ms per loop 

%timeit add_two_2ds_parallel(A,B,res) 
The slowest run took 11.82 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 2.43 ms per loop 

%timeit add_two_2ds_numexpr(A,B,res) 
100 loops, best of 3: 2.79 ms per loop 

%timeit add_ufunc(A, B, res) 
The slowest run took 9.24 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 2.03 ms per loop 

(这是一个非常类似的OS X系统给你,但与OS X 10.11。)

尽管Numba的并行ufunc现在击败numexpr(我看到add_ufunc使用大约280%的CPU),它不击败si多线程单CPU的情况下。我怀疑瓶颈是由于内存(或缓存)带宽,但我没有做过测量来检查。

一般来说,如果您对每个内存元素进行更多的数学运算(比如说余弦),您将会从并行ufunc目标中看到更多的好处。