我有关于如何处理大矩阵的一些问题。就像解释in this other question我有一个程序可以在大的矩阵上工作(比如5k-10k)。计算部分是正确的(仍然不是100%优化),我用较小的方形矩阵(如256-512)进行了测试。这里是我的代码:cuda大矩阵和块/线程
#define N 10000
#define RADIUS 100
#define SQRADIUS RADIUS*RADIUS
#define THREADS 512
//many of these device functions are declared
__device__ unsigned char avg(const unsigned char *src, const unsigned int row, const unsigned int col) {
unsigned int sum = 0, c = 0;
//some work with radius and stuff
return sum;
}
__global__ void applyAvg(const unsigned char *src, unsigned char *dest) {
unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x, tmp = 0;
unsigned int stride = blockDim.x * gridDim.x;
int col = tid%N, row = (int)tid/N;
while(tid < N*N) {
if(row * col < N * N) {
//choose which of the __device__ functions needs to be launched
}
tid += stride;
col = tid%N, row = (int)tid/N;
}
__syncthreads();
}
int main(void) {
cudaError_t err;
unsigned char *base, *thresh, *d_base, *d_thresh, *avg, *d_avg;
int i, j;
base = (unsigned char*)malloc((N * N) * sizeof(unsigned char));
thresh = (unsigned char*)malloc((N * N) * sizeof(unsigned char));
avg = (unsigned char*)malloc((N * N) * sizeof(unsigned char));
err = cudaMalloc((void**)&d_base, (N * N) * sizeof(unsigned char));
if(err != cudaSuccess) {printf("ERROR 1"); exit(-1);}
err = cudaMalloc((void**)&d_thresh, (N * N) * sizeof(unsigned char));
if(err != cudaSuccess) {printf("ERROR 2"); exit(-1);}
err = cudaMalloc((void**)&d_avg, (N * N) * sizeof(unsigned char));
if(err != cudaSuccess) {printf("ERROR 3"); exit(-1);}
for(i = 0; i < N * N; i++) {
base[i] = (unsigned char)(rand() % 256);
}
err = cudaMemcpy(d_base, base, (N * N) * sizeof(unsigned char), cudaMemcpyHostToDevice);
if(err != cudaSuccess){printf("ERROR 4"); exit(-1);}
//more 'light' stuff to do before the 'heavy computation'
applyAvg<<<(N + THREADS - 1)/THREADS, THREADS>>>(d_thresh, d_avg);
err = cudaMemcpy(thresh, d_thresh, (N * N) * sizeof(unsigned char), cudaMemcpyDeviceToHost);
if(err != cudaSuccess) {printf("ERROR 5"); exit(-1);}
err = cudaMemcpy(avg, d_avg, (N * N) * sizeof(unsigned char), cudaMemcpyDeviceToHost);
if(err != cudaSuccess) {printf("ERROR 6"); exit(-1);}
getchar();
return 0;
}
当启动一个大矩阵的问题(如10000×10000)和100为半径(这是从矩阵我向前看的每一个点是如何“远”),它需要如此多的时间。
我相信问题在于applyAvg<<<(N + THREADS - 1)/THREADS, THREADS>>>
(我决定运行多少个块和线程)以及applyAvg(...)
方法(跨度和tid)。 有人能澄清我哪个是决定要发射多少块/线程的最好方法,因为矩阵可以从5k到10k不等。
谢谢您的建议。是的,这是一种过滤。我真的搞砸了块/线程的工作方式......可以说我有一个10000 x 10000的矩阵,每个像素需要过滤,这是启动我的内核的最佳方式?目前我正在启动大约20个块,每块有512个线程,对吧?还有另一种更好地使用65k块的方法吗?或者更高性能的使用线程?我对CUDA很陌生,一切看起来都是如此混乱每块xD – n0n4m3
对于2D任务,我建议你使用16x16线程/块和625x625块/网格。没有必要最大化块的数量。在这里看到一维任务。 http://stackoverflow.com/questions/19422993/how-to-chose-value-of-block-and-thread-in-cuda/19423751#19423751 – kangshiyin
再次感谢你,你真的很有帮助。我要读你与我联系的论文......你应该得到绿色的答案;) – n0n4m3