0
我正在Mac上使用opencl 1.2开发一个简单的radix-2 FFT算法。我试图用HD 5000个显卡在我的笔记本电脑OpenCL clEnqueueNDRangeKernel循环中
我的主机代码是这样的:
gws=4;
lws=1;
for (cur_iter=0; cur_iter <= 2; cur_iter++){
ret = clSetKernelArg(r2kernel, 3, sizeof(cl_int), (void *)&cur_iter);
printf("iter %d \n", cur_iter);
ret = clEnqueueNDRangeKernel(command_queue, r2kernel, 1, NULL, &gws, &lws, 0, NULL, &kernelDone);
// printf("ret %d \n", ret);
ret = clWaitForEvents(1, &kernelDone);
// printf("ret %d \n", ret);
}
CUR_ITER意味着FFT的当前阶段。我的内核代码是这样的:
kernel void radix2(global float2 * x, global float2 * w,int iter, int cur_iter)
{
int gid = get_global_id(0); // number of threads
int butterflySize = 1 << (iter-cur_iter-1);
int butterflyGrpDist = 1 << (iter-cur_iter);
int butterflyGrpBase = (gid >> (iter-cur_iter-1))*(butterflyGrpDist);
int butterflyGrpOffset = gid & (butterflySize-1);
int a = butterflyGrpBase + butterflyGrpOffset;
int b = a + butterflySize;
printf("gid %d pass %d, %d, %d ,total iter %d \n", gid,cur_iter,a,b,iter);
float2 u0 = x[a];
float2 u1 = x[b];
float2 tmp;
DFT2(u0,u1,tmp);
int waddr=butterflyGrpOffset<<cur_iter;
float2 twiddle = w[waddr];
MUL(u1,twiddle,tmp);
x[a] = u0;
x[b] = u1;
}
我打印出内核中的gid和cur_iter。我期望在每次迭代中获得4个内核(用于8点FFT)。但我得到的是这样的
iter 0
gid 0 pass 0, 0, 4 ,total iter 3
gid 1 pass 0, 1, 5 ,total iter 3
gid 2 pass 0, 2, 6 ,total iter 3
gid 3 pass 0, 3, 7 ,total iter 3
iter 1
gid 0 pass 0, 0, 4 ,total iter 3
gid 1 pass 0, 1, 5 ,total iter 3
gid 2 pass 0, 2, 6 ,total iter 3
gid 3 pass 0, 3, 7 ,total iter 3
gid 0 pass 0, 0, 4 ,total iter 3
gid 1 pass 0, 1, 5 ,total iter 3
gid 2 pass 0, 2, 6 ,total iter 3
gid 3 pass 0, 3, 7 ,total iter 3
gid 0 pass 1, 0, 2 ,total iter 3
gid 1 pass 1, 1, 3 ,total iter 3
gid 2 pass 1, 4, 6 ,total iter 3
gid 3 pass 1, 5, 7 ,total iter 3
iter 2
gid 0 pass 0, 0, 4 ,total iter 3
gid 1 pass 0, 1, 5 ,total iter 3
gid 2 pass 0, 2, 6 ,total iter 3
gid 3 pass 0, 3, 7 ,total iter 3
gid 0 pass 0, 0, 4 ,total iter 3
gid 1 pass 0, 1, 5 ,total iter 3
gid 2 pass 0, 2, 6 ,total iter 3
gid 3 pass 0, 3, 7 ,total iter 3
gid 0 pass 1, 0, 2 ,total iter 3
gid 1 pass 1, 1, 3 ,total iter 3
gid 2 pass 1, 4, 6 ,total iter 3
gid 3 pass 1, 5, 7 ,total iter 3
gid 0 pass 0, 0, 4 ,total iter 3
gid 1 pass 0, 1, 5 ,total iter 3
gid 2 pass 0, 2, 6 ,total iter 3
gid 3 pass 0, 3, 7 ,total iter 3
gid 0 pass 1, 0, 2 ,total iter 3
gid 1 pass 1, 1, 3 ,total iter 3
gid 2 pass 1, 4, 6 ,total iter 3
gid 3 pass 1, 5, 7 ,total iter 3
gid 0 pass 2, 0, 1 ,total iter 3
gid 2 pass 2, 4, 5 ,total iter 3
gid 3 pass 2, 6, 7 ,total iter 3
gid 1 pass 2, 2, 3 ,total iter 3
在每次迭代这意味着,在CUR_ITER传递给我的内核总是从零开始,并启动内核的情况下,也是错误的,甚至它的值是2或3。我想知道为什么。任何形式的帮助将不胜感激!
非常感谢你的答复我的问题已经解决了,我用FFT比较了FFT结果并且它们匹配,我认为我不应该使用在我的内核中打印函数,但是这引出了另一个问题:是否有我们可以监视内核中的变量?FYI,clFinish和clWaitForEvents都可以工作 – Jeff