我在使用AMD Radeon HD 6850使用LDS内存时遇到性能问题。OpenCL中LDS内存的性能问题
我有两个内核作为N粒子模拟的一部分。每个工作单元必须根据与其他粒子的相对位置计算作用于相应粒子的力。该问题的核心是:
#define UNROLL_FACTOR 8
//Vernet velocity part kernel
__kernel void kernel_velocity(const float deltaTime,
__global const float4 *pos,
__global float4 *vel,
__global float4 *accel,
__local float4 *pblock,
const float bound)
{
const int gid = get_global_id(0); //global id of work item
const int id = get_local_id(0); //local id of work item within work group
const int s_wg = get_local_size(0); //work group size
const int n_wg = get_num_groups(0); //number of work groups
const float4 myPos = pos[gid];
const float4 myVel = vel[gid];
const float4 dt = (float4)(deltaTime, deltaTime, 0.0f, 0.0f);
float4 acc = (float4)0.0f;
for (int jw = 0; jw < n_wg; ++jw)
{
pblock[id] = pos[jw * s_wg + id]; //cache a particle position; position in array: workgroup no. * size of workgroup + local id
barrier (CLK_LOCAL_MEM_FENCE); //wait for others in the work group
for (int i = 0; i < s_wg;)
{
#pragma unroll UNROLL_FACTOR
for (int j = 0; j < UNROLL_FACTOR; ++j, ++i)
{
float4 r = myPos - pblock[i];
float rSizeSquareInv = native_recip (r.x*r.x + r.y*r.y + 0.0001f);
float rSizeSquareInvDouble = rSizeSquareInv * rSizeSquareInv;
float rSizeSquareInvQuadr = rSizeSquareInvDouble * rSizeSquareInvDouble;
float rSizeSquareInvHept = rSizeSquareInvQuadr * rSizeSquareInvDouble * rSizeSquareInv;
acc += r * (2.0f * rSizeSquareInvHept - rSizeSquareInvQuadr);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
}
acc *= 24.0f/myPos.w;
//update velocity only
float4 newVel = myVel + 0.5f * dt * (accel[gid] + acc);
//write to global memory
vel[gid] = newVel;
accel[gid] = acc;
}
模拟运行在成果方面很好,但是使用本地内存来缓存颗粒的位置,以减轻从全局内存读取大量时的问题是性能。其实如果线路
float4 r = myPos - pblock[i];
由
float4 r = myPos - pos[jw * s_wg + i];
更换内核的运行速度更快。我真的不明白,因为从全球读书应该比从当地读书慢得多。
此外,当线
float4 r = myPos - pblock[i];
被完全去除和所有以下的r
OCCURENCES由myPos - pblock[i]
替换,速度和以前一样,就好像线是不存在的。这并不是因为访问r
中的私有内存应该是最快的,但编译器以某种方式“优化”了这条线。
全局工作大小为4608,本地工作大小为192.它在Ubuntu 12.04中使用AMD APP SDK v2.9和Catalyst驱动程序13.12运行。
任何人都可以帮助我吗?这是我的错,还是GPU /驱动程序/ ...的问题?或者它是一个功能? :-)
障碍:这也是我的想法,但消除障碍并没有改善性能,并导致结果无意义。 –
至于基准:感谢链接。如果结果是正确的,那么,omg,这个GPU在内存延迟方面真的很糟糕! –