我对OpenCL非常陌生,正在通过Altera OpenCL示例。 在他们的矩阵乘法的例子中,他们已经使用了块的概念,其中输入矩阵的维数是块大小的倍数。下面的代码:OpenCL Matrix乘法Altera示例
void matrixMult(// Input and output matrices
__global float *restrict C,
__global float *A,
__global float *B,
// Widths of matrices.
int A_width, int B_width)
{
// Local storage for a block of input matrices A and B
__local float A_local[BLOCK_SIZE][BLOCK_SIZE];
__local float B_local[BLOCK_SIZE][BLOCK_SIZE];
// Block index
int block_x = get_group_id(0);
int block_y = get_group_id(1);
// Local ID index (offset within a block)
int local_x = get_local_id(0);
int local_y = get_local_id(1);
// Compute loop bounds
int a_start = A_width * BLOCK_SIZE * block_y;
int a_end = a_start + A_width - 1;
int b_start = BLOCK_SIZE * block_x;
float running_sum = 0.0f;
for (int a = a_start, b = b_start; a <= a_end; a += BLOCK_SIZE, b += (BLOCK_SIZE * B_width))
{
A_local[local_y][local_x] = A[a + A_width * local_y + local_x];
B_local[local_x][local_y] = B[b + B_width * local_y + local_x];
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; ++k)
{
running_sum += A_local[local_y][k] * B_local[local_x][k];
}
}
// Store result in matrix C
C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = running_sum;
}
假设块大小为2,则:block_x
和block_y
均为0;并且local_x
和local_y
都是0.
然后A_local[0][0]
将是A[0]
和B_local[0][0]
将是B[0]
。
尺寸A_local
和B_local
其中每个插件4个元素。
在这种情况下,A_local
和B_local
如何在该迭代中访问块中的其他元素?
也将单独的线程/核心分配给每个local_x
和local_y
?