2011-01-22 117 views
4

我有一些代码要编入cuda内核。看哪:CUDA:嵌入式循环内核

for (r = Y; r < Y + H; r+=2) 
    { 
     ch1RowSum = ch2RowSum = ch3RowSum = 0; 
     for (c = X; c < X + W; c+=2) 
     { 
      chan1Value = //some calc'd value 
          chan3Value = //some calc'd value 
      chan2Value = //some calc'd value 
      ch2RowSum += chan2Value; 
      ch3RowSum += chan3Value; 
      ch1RowSum += chan1Value; 
     } 
     ch1Mean += ch1RowSum/W; 
     ch2Mean += ch2RowSum/W; 
     ch3Mean += ch3RowSum/W; 
    } 

如果有这样的分成两个内核,一个计算RowSums和一个计算方式,我应该如何处理的事实,我的循环指数不以零开始,在N个结束?

+0

尝试选择一个问题,它很难选择正确的答案。但是,至于你的第二个问题......很难专门回答,但我认为一旦你开发内核的时候你会看到更远。 – jmilloy 2011-01-22 23:24:14

+0

你应该用每块H块和W线程的配置启动你的内核。然后,您将从内核中的blockIdx和threadIdx值计算r和c。计算r和c然而你想...我试图把这个在我的答案下面... – jmilloy 2011-01-22 23:26:18

回答

1

假设您有一个计算三个值的内核。配置中的每个线程将计算每个(r,c)对的三个值。

__global__ value_kernel(Y, H, X, W) 
{ 
    r = blockIdx.x + Y; 
    c = threadIdx.x + W; 

    chan1value = ... 
    chan2value = ... 
    chan3value = ... 
} 

我不相信你可以在上面的内核中计算总和(完全并行,至少)。你将无法像上面那样使用+ =。你可以把它们都放在一个内核,如果你在每个块(行)只有一个线程做之和的意思是,像这样...

__global__ both_kernel(Y, H, X, W) 
{ 
    r = blockIdx.x + Y; 
    c = threadIdx.x + W; 

    chan1value = ... 
    chan2value = ... 
    chan3value = ... 

    if(threadIdx.x == 0) 
    { 
     ch1RowSum = 0; 
     ch2RowSum = 0; 
     ch3RowSum = 0; 

     for(i=0; i<blockDim.x; i++) 
     { 
      ch1RowSum += chan1value; 
      ch2RowSum += chan2value; 
      ch3RowSum += chan3value; 
     } 

     ch1Mean = ch1RowSum/blockDim.x; 
     ch2Mean = ch2RowSum/blockDim.x; 
     ch3Mean = ch3RowSum/blockDim.x; 
    } 
} 

,但它可能会更好使用的第一个价值内核,然后第二个内核既可以用于汇总也可以用于......可以在下面进一步对内核进行并行处理,如果它们是分开的,则可以在准备就绪时专注于该内核。

__global__ sum_kernel(Y,W) 
{ 
    r = blockIdx.x + Y; 

    ch1RowSum = 0; 
    ch2RowSum = 0; 
    ch3RowSum = 0; 

    for(i=0; i<W; i++) 
    { 
     ch1RowSum += chan1value; 
     ch2RowSum += chan2value; 
     ch3RowSum += chan3value; 
    } 

    ch1Mean = ch1RowSum/W; 
    ch2Mean = ch2RowSum/W; 
    ch3Mean = ch3RowSum/W; 
}