2016-12-22 30 views
0

问题: 我获得两个计算的阵列和两个期望输出在MATLAB Cuda的获取随机值次级输出

  1. 右计算出的输出
  2. 随机数,旧号码,号码从另一阵列

我使用MATLAB R2016B这科达版本+ GPU:

CUDADevice with properties: 

        Name: 'GeForce GT 525M' 
       Index: 1 
    ComputeCapability: '2.1' 
     SupportsDouble: 1 
     DriverVersion: 8 
     ToolkitVersion: 7.5000 
    MaxThreadsPerBlock: 1024 
     MaxShmemPerBlock: 49152 
    MaxThreadBlockSize: [1024 1024 64] 
      MaxGridSize: [65535 65535 65535] 
      SIMDWidth: 32 
      TotalMemory: 1.0737e+09 
     AvailableMemory: 947929088 
    MultiprocessorCount: 2 
      ClockRateKHz: 1200000 
      ComputeMode: 'Default' 
    GPUOverlapsTransfers: 1 
KernelExecutionTimeout: 1 
     CanMapHostMemory: 1 
     DeviceSupported: 1 
     DeviceSelected: 1 

我现在将尝试使用GPU添加和减去两个不同的数组,并将其返回给MATLAB。

MATLAB代码:

n = 10; 
as = [1,1,1]; 
bs = [10,10,10]; 

for i = 2:n+1 
    as(end+1,:) = [i,i,i]; 
    bs(end+1,:) = [10,10,10]; 
end 
as = as *1; 

% Load the kernel 
cudaFilename = 'add2.cu'; 
ptxFilename = ['add2.ptx']; 

% Check if the files are awareable 
if((exist(cudaFilename, 'file') || exist(ptxFilename, 'file')) == 2) 
    error('CUDA FILES ARE NOT HERE'); 
end 
kernel = parallel.gpu.CUDAKernel(ptxFilename, cudaFilename); 

% Make sure we have sufficient blocks to cover all of the locations 
kernel.ThreadBlockSize = [kernel.MaxThreadsPerBlock,1,1]; 
kernel.GridSize = [ceil(n/kernel.MaxThreadsPerBlock),1]; 

% Call the kernel 
outadd = zeros(n,1, 'single'); 
outminus = zeros(n,1, 'single'); 
[outadd, outminus] = feval(kernel, outadd,outminus, as, bs); 

Cuda的片断

#include "cuda_runtime.h" 
#include "add_wrapper.hpp" 
#include <stdio.h> 

__device__ size_t calculateGlobalIndex() { 
    // Which block are we? 
    size_t const globalBlockIndex = blockIdx.x + blockIdx.y * gridDim.x; 
    // Which thread are we within the block? 
    size_t const localThreadIdx = threadIdx.x + blockDim.x * threadIdx.y; 
    // How big is each block? 
    size_t const threadsPerBlock = blockDim.x*blockDim.y; 
    // Which thread are we overall? 
    return localThreadIdx + globalBlockIndex*threadsPerBlock; 
} 

__global__ void addKernel(float *c, float *d, const float *a, const float *b) 
{ 
    int i = calculateGlobalIndex(); 
    c[i] = a[i] + b[i]; 
    d[i] = a[i] - b[i]; 
} 

// C = A + B 
// D = A - B 
void addWithCUDA(float *cpuC,float *cpuD, const float *cpuA, const float *cpuB, const size_t sz) 
{ 
//TODO: add error checking 

// choose which GPU to run on 
cudaSetDevice(0); 

// allocate GPU buffers 
float *gpuA, *gpuB, *gpuC, *gpuD; 
cudaMalloc((void**)&gpuA, sz*sizeof(float)); 
cudaMalloc((void**)&gpuB, sz*sizeof(float)); 
cudaMalloc((void**)&gpuC, sz*sizeof(float)); 
cudaMalloc((void**)&gpuD, sz*sizeof(float)); 
cudaCheckErrors("cudaMalloc fail"); 

// copy input vectors from host memory to GPU buffers 
cudaMemcpy(gpuA, cpuA, sz*sizeof(float), cudaMemcpyHostToDevice); 
cudaMemcpy(gpuB, cpuB, sz*sizeof(float), cudaMemcpyHostToDevice); 

// launch kernel on the GPU with one thread per element 
addKernel<<<1,sz>>>(gpuC, gpuD, gpuA, gpuB); 

// wait for the kernel to finish 
cudaDeviceSynchronize(); 

// copy output vector from GPU buffer to host memory 
cudaMemcpy(cpuC, gpuC, sz*sizeof(float), cudaMemcpyDeviceToHost); 
cudaMemcpy(cpuD, gpuD, sz*sizeof(float), cudaMemcpyDeviceToHost); 


// cleanup 
cudaFree(gpuA); 
cudaFree(gpuB); 
cudaFree(gpuC); 
cudaFree(gpuD); 
} 

void resetDevice() 
{ 
    cudaDeviceReset(); 
} 

[outadd, outminus]在MATLAB 2个GPU阵列对象运行的代码之后。

Outadd总是正确计算,outminus很少正确的,大多含有随机整数或浮点数,零或outadd有时甚至价值观。

如果我换算术运算顺序它的作品了,所以另外一所以不是“outminus”应该被正确地计算?

+0

欢迎堆栈溢出。你似乎忘了问一个问题。问题用问号(?)表示并可以接收答案。请[编辑]您的文章以包含一个问题,因为它看起来很不错! – Adriaan

+0

'kernel.MaxThreadsPerBlock'为1024.由于'n'为10,因此即使您只需要10个内核,您的内核也会启动1个1024线程的数据块。这些额外的线程可能会越界访问您的数组,因此您应该通过'n'作为内核的标量参数,而在你的内核中,你应该对'n'测试'i'。你可能想[这里MATLAB示例]研究这个(https://www.mathworks.com/help/distcomp/examples/illustrating-three-approaches-to-gpu-computing-the-mandelbrot-set.html)。 –

+0

@Robert Crovella我想我只能呆在这里,我会重新使用限制线程。谢谢! – Jeahinator

回答

1

使用@Robert Crovella暗示不必要的线程可能会导致访问错误,我只是添加了对线程的限制。

MATLAB

[outadd, outminus] = feval(kernel, outadd,outminus, as, bs, n); 

CUDA核方法

__global__ void addKernel(float *c, float *d, const float *a, const float *b, const float n) 
{ 
    int i = calculateGlobalIndex(); 
    if (i < n){ 
     c[i] = a[i] + b[i]; 
     d[i] = a[i] - b[i]; 
    } 
} 

我认为它仍然不是最佳的解决方案,因为GPU仍然启动的所有线程即使是最不应该使用很多资源。

以适当的方式重新加工之后,我会在这里上传。