OpenCL转置内核如何使用get_local_id

取自样本的代码。我用它创建了一个项目，它工作，但我不明白一些部分。OpenCL转置内核如何使用get_local_id

对于该示例的目的，说我有32×32矩阵，有36个工作项目等get_global_id（0）从0 - > 35我想，和大小= MATRIX_DIM/4 = 8。

__kernel void transpose(__global float4 *g_mat, 
    __local float4 *l_mat, uint size) { 

    __global float4 *src, *dst; 

    /* Determine row and column location */ 
    int col = get_global_id(0); 
    int row = 0; 
    while(col >= size) { 
     col -= size--; 
     row++; 
    } 
    col += row; 
    size += row; 

    /* Read source block into local memory */ 
    src = g_mat + row * size * 4 + col; 
    l_mat += get_local_id(0)*8;

在clEnqueueNDRangeKernel调用，ARG local_work_size设置为NULL根据手动装置，它让编译器什么的数字出来：

local_work_size can also be a NULL value in which case the OpenCL implementation will determine how to be break the global work-items into appropriate work-group instances.

但我不明白的乘法由8，这给了一个地址偏移到我认为工作组的本地内存。有人可以解释这个吗？

l_mat[0] = src[0]; 
    l_mat[1] = src[size]; 
    l_mat[2] = src[2*size]; 
    l_mat[3] = src[3*size]; 

    /* Process block on diagonal */ 
    if(row == col) { 
     src[0] = 
     (float4)(l_mat[0].x, l_mat[1].x, l_mat[2].x, l_mat[3].x); 
     src[size] = 
     (float4)(l_mat[0].y, l_mat[1].y, l_mat[2].y, l_mat[3].y); 
     src[2*size] = 
     (float4)(l_mat[0].z, l_mat[1].z, l_mat[2].z, l_mat[3].z); 
     src[3*size] = 
     (float4)(l_mat[0].w, l_mat[1].w, l_mat[2].w, l_mat[3].w); 
    } 
    /* Process block off diagonal */ 
    else { 
     /* Read destination block into local memory */ 
     dst = g_mat + col * size * 4 + row; 
     l_mat[4] = dst[0]; 
     l_mat[5] = dst[size]; 
     l_mat[6] = dst[2*size]; 
     l_mat[7] = dst[3*size]; 

     /* Set elements of destination block */ 
     dst[0] = 
     (float4)(l_mat[0].x, l_mat[1].x, l_mat[2].x, l_mat[3].x); 
     dst[size] = 
     (float4)(l_mat[0].y, l_mat[1].y, l_mat[2].y, l_mat[3].y); 
     dst[2*size] = 
     (float4)(l_mat[0].z, l_mat[1].z, l_mat[2].z, l_mat[3].z); 
     dst[3*size] = 
     (float4)(l_mat[0].w, l_mat[1].w, l_mat[2].w, l_mat[3].w); 

     /* Set elements of source block */ 
     src[0] = 
     (float4)(l_mat[4].x, l_mat[5].x, l_mat[6].x, l_mat[7].x); 
     src[size] = 
     (float4)(l_mat[4].y, l_mat[5].y, l_mat[6].y, l_mat[7].y); 
     src[2*size] = 
     (float4)(l_mat[4].z, l_mat[5].z, l_mat[6].z, l_mat[7].z); 
     src[3*size] = 
     (float4)(l_mat[4].w, l_mat[5].w, l_mat[6].w, l_mat[7].w); 
    } 
}

来源

2012-11-07 JDS

l_mat正被用于工作组中线程的本地存储。特别是它正在被使用，因为对本地内存的访问速度比全局内存快几个数量级。

每个线程需要8 float4 s。执行以下操作，指针运算

l_mat += get_local_id(0)*8;

移动l_mat指针为每个线程，使得它不与其他线程的数据重叠。

This could由于未指定local_size而导致错误，我们无法确保l_mat的大小足以存储每个线程的值。

来源

2012-11-07 18:25:32 KLee1

“每个线程都需要8个float4s”你是指在读入本地内存时它是如何运行l_mat [0] - > l_mat [7]的权利？这是有道理的......但我认为每个'float4'都是16个字节，而我们的内存是字节可寻址的 - 稍等一下，l_mat被声明为一个指向float4s的指针，所以我猜想对齐是被照顾的。那么，如果这就是我所知道的，但这是棘手的东西！ – JDS

@YoungMoney：这里没有对齐问题，这只是通常的C指针算术，它以字节为单位对底层元素的大小进行了完全的抽象。 – Pragmateek

l_mat被用作临时缓冲器，用于存储在两个基质组分反转为所有工作项。因此，对于每个工作项

它需要存储2 * 4 float4s，因此：偏移= get_local_id（0）* 2 * 4 = get_local_id（0）* 8。

来源

2012-11-07 18:20:32 Pragmateek

OpenCL转置内核如何使用get_local_id

回答

相关问题