并行列表减少在CUDA

我正在通过Cuda并行减少白皮书，但不幸的是我的算法似乎反复产生不正确的结果，我似乎无法弄清楚为什么（当然一个教科书示例必须工作？当然，我是只是做一些非常明显的错误？）。这里是我的内核功能：并行列表减少在CUDA

我的定义：

#define BLOCK_SIZE 512

我的内核功能：

__global__ void total(float * inputList, float * outputList, int len) { 
     __shared__ float sdata[2*BLOCK_SIZE]; 
     unsigned int tid = threadIdx.x; 
     unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x; 
     sdata[t] = inputList[i]+inputList[i+blockDim.x]; 
     __syncthreads(); 
     for (unsigned int s=blockDim.x/2; s>0; s>>=1) { 
     if (tid < s) { 
      sdata[tid] += sdata[tid + s]; 
     } 
     __syncthreads(); 
     } 
     if (tid == 0) 
     outputList[blockIdx.x] = sdata[0]; 
}

我的内存分配：

outputSize = inputSize/(BLOCK_SIZE<<1); 
    cudaMalloc((void**) &deviceInput, inputSize*sizeof(float)); 
    cudaMalloc((void**) &deviceOutput, outputSize*sizeof(float)); 
    cudaMemcpy(deviceInput, hostInput, inputSize*sizeof(float), cudaMemcpyHostToDevice);

我的设备电话：

dim3 dimGrid((inputSize-1)/BLOCK_SIZE +1, 1, 1); 
dim3 dimBlock(BLOCK_SIZE,1,1); 

total<<<dimBlock, dimGrid>>>(deviceInput, deviceOutput, outputSize); 
cudaDeviceSynchronize();

我的存储器中取出：

cudaMemcpy(hostOutput, deviceOutput, outputSize*sizeof(float), cudaMemcpyDeviceToHost);

最后我最后的计算：

for (int counter = 1; counter < outputSize; counter++) { 
    hostOutput[0] += hostOutput[counter]; 
}

任何帮助，将不胜感激。

来源

2013-01-14 Abraham P

内核函数输入参数'int len'似乎从未使用过。 – kangshiyin

您的代码的以下行中的内核启动配置不正确。

total<<<dimBlock, dimGrid>>>(deviceInput, deviceOutput, outputSize);

内核配置的第一个参数是网格大小，第二个参数是块大小。

你应该这样做：

total<<<dimGrid, dimBlock>>>(deviceInput, deviceOutput, outputSize);

请永远perform error checking on CUDA Runtime function calls，并检查返回的错误代码以获取你的程序的失败的原因。

您的内核启动应该在当前代码中失败。在cudaDeviceSynchronize调用中检查错误会导致您导致错误结果的原因。

来源

2013-01-14 11:06:13 sgarizvi

好点 - 这总是*会导致不正确的结果。我在答案中提到的假设只会在某些情况下导致问题... – Matt

该代码假定输入大小是块大小的倍数。如果inputSize不是块大小的倍数，它将读取inputList数组的末尾。

来源

2013-01-14 10:29:51 Matt

并行列表减少在CUDA

回答

相关问题