GPU上的连续内存分配

cudaMalloc是否分配连续的内存块（即彼此相邻的物理字节）？GPU上的连续内存分配

我有一段CUDA代码，它使用32个线程将全局设备内存中的128个字节复制到共享内存。我试图找到一种方法来保证这个传输可以在一个128字节的内存事务中完成。如果cudaMalloc分配连续的内存块，那么它可以很容易地完成。

以下是代码：

#include <iostream> 

using namespace std; 
#define SIZE 32 //SIZE of the array to store in shared memory                               
#define NUMTHREADS 32 
__global__ void copy(uint* memPointer){ 

    extern __shared__ uint bits[]; 
    int tid = threadIdx.x; 

    bits[tid] = memPointer[tid]; 

} 

int main(){ 
    uint inputData[SIZE]; 
    uint* storedData; 
    for(int i=0;i<SIZE;i++){ 
    inputData[i] = i; 
    } 
    cudaError_t e1=cudaMalloc((void**) &storedData, sizeof(uint)*SIZE); 
    if(e1 == cudaSuccess){ 
    cudaError_t e3= cudaMemcpy(storedData, inputData, sizeof(uint)*SIZE, cudaMemcpyHostToDevice); 
     if(e3==cudaSuccess){ 
     copy<<<1,NUMTHREADS, SIZE*4>>>(storedData); 
      cudaError_t e6 = cudaFree(storedData); 
      if(e6==cudaSuccess){ 
      } 
      else{ 
       cout << "Error freeing memory storedData" << e6 << endl; 
      } 
     } 
     else{ 
     cout << "Failed to copy" << " " << e3 << endl; 
     } 

    } 
    else{ 
    cout << "Failed to allocate memory" << " " << e1 << endl; 

    } 
    return 0; 
}

来源

2012-07-02 gmemon

该内核应该服务的目的是什么？ – talonmies

它是我在其中对数据执行一些操作的较大代码的一部分。我正在尝试优化代码的各个部分。 – gmemon

如果128字节块是128字节对齐，那么这将在一个事务中完成。 NVIDIA GPU具有独立于CPU MMU的MMU。所有GPU内存操作都是通过GPU虚拟地址空间完成的。不能保证大于缓存行的块物理上连续。 –

是，cudaMalloc分配存储连续块。 SDK中的“Matrix Transpose”示例（http://developer.nvidia.com/cuda-cc-sdk-code-samples）有一个名为“copySharedMem”的内核，它几乎完全符合您所描述的内容。

来源

2012-07-02 17:11:19 azaghal

GPU上的连续内存分配

回答

相关问题