CUDA内核不能启动CudaDeviceSynchronize
之前我有一些麻烦并行CUDA。看看附带的图像。内核在0.395秒的标记点处启动。然后有一些绿色的CpuWork。最后,调用cudaDeviceSynchronize。在CpuWork之前启动的内核在同步调用之前不启动。理想情况下,它应该与CPU工作并行运行。
void KdTreeGpu::traceRaysOnGpuAsync(int firstRayIndex, int numRays, int rank, int buffer)
{
int per_block = 128;
int num_blocks = numRays/per_block + (numRays%per_block==0?0:1);
Ray* rays = &this->deviceRayPtr[firstRayIndex];
int* outputHitPanelIds = &this->deviceHitPanelIdPtr[firstRayIndex];
kdTreeTraversal<<<num_blocks, per_block, 0>>>(sceneBoundingBox, rays, deviceNodesPtr, deviceTrianglesListPtr,
firstRayIndex, numRays, rank, rootNodeIndex,
deviceTHitPtr, outputHitPanelIds, deviceReflectionPtr);
CUDA_VALIDATE(cudaMemcpyAsync(resultHitDistances[buffer], deviceTHitPtr, numRays*sizeof(double), cudaMemcpyDeviceToHost));
CUDA_VALIDATE(cudaMemcpyAsync(resultHitPanelIds[buffer], outputHitPanelIds, numRays*sizeof(int), cudaMemcpyDeviceToHost));
CUDA_VALIDATE(cudaMemcpyAsync(resultReflections[buffer], deviceReflectionPtr, numRays*sizeof(Vector3), cudaMemcpyDeviceToHost));
}
该memcopies是异步的。结果缓冲区像这样分配
unsigned int flag = cudaHostAllocPortable;
CUDA_VALIDATE(cudaHostAlloc(&resultHitPanelIds[0], MAX_RAYS_PER_ITERATION*sizeof(int), flag));
CUDA_VALIDATE(cudaHostAlloc(&resultHitPanelIds[1], MAX_RAYS_PER_ITERATION*sizeof(int), flag));
希望能找到解决方案。已经尝试了很多东西,包括没有在默认流中运行。当我添加cudaHostAlloc我认识到异步方法返回到CPU。但是,当内核在稍后的deviceSynchronize调用之前不启动时,这没有帮助。
resultHitDistances[2]
包含两个分配的内存区域,以便当CPU读取0时,GPU应该把结果在1
谢谢!
编辑:这是调用traceRaysAsync的代码。
int numIterations = ceil(float(this->numPrimaryRays)/MAX_RAYS_PER_ITERATION);
int numRaysPrevious = min(MAX_RAYS_PER_ITERATION, this->numPrimaryRays);
nvtxRangePushA("traceRaysOnGpuAsync First");
traceRaysOnGpuAsync(0, numRaysPrevious, rank, 0);
nvtxRangePop();
for(int iteration = 0; iteration < numIterations; iteration++)
{
int rayFrom = (iteration+1)*MAX_RAYS_PER_ITERATION;
int rayTo = min((iteration+2)*MAX_RAYS_PER_ITERATION, this->numPrimaryRays) - 1;
int numRaysIteration = rayTo-rayFrom+1;
// Wait for results to finish and get them
waitForGpu();
// Trace the next iteration asynchronously. This will have data prepared for next iteration
if(numRaysIteration > 0)
{
int nextBuffer = (iteration+1) % 2;
nvtxRangePushA("traceRaysOnGpuAsync Interior");
traceRaysOnGpuAsync(rayFrom, numRaysIteration, rank, nextBuffer);
nvtxRangePop();
}
nvtxRangePushA("CpuWork");
// Store results for current iteration
int rayOffset = iteration*MAX_RAYS_PER_ITERATION;
int buffer = iteration % 2;
for(int i = 0; i < numRaysPrevious; i++)
{
if(this->activeRays[rayOffset+i] && resultHitPanelIds[buffer][i] >= 0)
{
this->activeRays[rayOffset+i] = false;
const TrianglePanelPair & t = this->getTriangle(resultHitPanelIds[buffer][i]);
double hitT = resultHitDistances[buffer][i];
Vector3 reflectedDirection = resultReflections[buffer][i];
Result res = Result(rays[rayOffset+i], hitT, t.panel);
results[rank].push_back(res);
t.panel->incrementIntensity(1.0);
if (t.panel->getParent().absorbtion < 1)
{
numberOfRaysGenerated++;
Ray reflected (res.endPoint() + 0.00001*reflectedDirection, reflectedDirection);
this->newRays[rayOffset+i] = reflected;
this->activeRays[rayOffset+i] = true;
numNewRays++;
}
}
}
numRaysPrevious = numRaysIteration;
nvtxRangePop();
}
您在KdTreeGpu :: traceRaysOnGpuAsync调用后没有显示代码,但这可能很有用,例如查看您在何处以及为什么使用cudaDeviceSynchronize()调用?我认为你在调用KdTreeGpu :: traceRaysOnGpuAsync后立即发出devicesync,但这会消除你的重叠。这是您想要重叠的区域,并假设第二个绿色CpuWork栏不依赖于kdTreeTraversal的结果,那么您希望移动或消除您的内核函数调用后的deviceync。在设备同步之前重新考虑一些CpuWork *。 –
我添加了一些已经清除了一些定时器的代码,所以应该更容易遵循。这两个缓冲区应该使CpuWork独立于内核启动。 – apartridge