我目前正在开发CUDA中的第一个项目,并且遇到了一些奇怪的问题,这些问题必须是CUDA固有的,我不理解或忽略。相同的算法 - 完全相同的算法 - 不涉及并行工作 - 在CPU上工作,但不在GPU上。CUDA - 相同的算法适用于CPU,但不适用于GPU
让我更详细地解释一下。我正在使用Otsu's method重复计算来确定阈值,但会减少传输时间。短篇小说长,这个功能:
__device__ double computeThreshold(unsigned int* histogram, int* nbPixels){
double sum = 0;
for (int i = 0; i < 256; i++){
sum += i*histogram[i];
}
int sumB = 0, wB = 0, wF = 0;
double mB, mF, max = 1, between = 0, threshold1 = 0, threshold2 = 0;
for (int j = 0; j < 256 && !(wF == 0 && j != 0 && wB != 0); j++){
wB += histogram[j];
if (wB != 0) {
wF = *nbPixels - wB;
if (wF != 0){
sumB += j*histogram[i];
mB = sumB/wB;
mF = (sum - sumB)/wF;
between = wB * wF *(mB - mF) *(mB - mF);
if (max < 2.0){
threshold1 = j;
if (between > max){
threshold2 = j;
}
max = between;
}
}
}
}
return (threshold1 + threshold2)/2.0;
}
这将按预期的图像大小(即像素数)不会太大,但无法以其他方式;有趣的是,即使我不在函数中使用histogram
和nbPixels
,并用常量替换它们的所有出现,它仍然会失败 - 即使我从函数中删除参数。 (我的意思是失败的是在调用内核之后的第一个操作返回未指定的启动失败。)
编辑3:好的,由于复制/粘贴错误,测试。现在,这个编译并允许重现错误:
__device__ double computeThreshold(unsigned int* histogram, long int* nbPixels){
double sum = 0;
for (int i = 0; i < 256; i++){
sum += i*histogram[i];
}
int sumB = 0, wB = 0, wF = 0;
double mB, mF, max = 1, between = 0, threshold1 = 0, threshold2 = 0;
for (int j = 0; j < 256 && !(wF == 0 && j != 0 && wB != 0); j++){
wB += histogram[j];
if (wB != 0) {
wF = *nbPixels - wB;
if (wF != 0){
sumB += j*histogram[j];
mB = sumB/wB;
mF = (sum - sumB)/wF;
between = wB * wF *(mB - mF) *(mB - mF);
if (max < 2.0){
threshold1 = j;
if (between > max){
threshold2 = j;
}
max = between;
}
}
}
}
return (threshold1 + threshold2)/2.0;
}
__global__ void imageKernel(unsigned int* image, unsigned int* histogram, long int* nbPixels, double* t_threshold){
unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
if (i >= *nbPixels) return;
double threshold = computeThreshold(histogram, nbPixels);
unsigned int pixel = image[i];
if (pixel >= threshold){
pixel = 255;
} else {
pixel = 0;
}
image[i] = pixel;
*t_threshold = threshold;
}
int main(){
unsigned int histogram[256] = { 0 };
const int width = 2048 * 4096;
const int height = 1;
unsigned int* myimage;
myimage = new unsigned int[width*height];
for (int i = 0; i < width*height; i++){
myimage[i] = i % 256;
histogram[i % 256]++;
}
const int threadPerBlock = 256;
const int nbBlock = ceil((double)(width*height)/threadPerBlock);
unsigned int* partial_histograms = new unsigned int[256 * nbBlock];
dim3 dimBlock(threadPerBlock, 1);
dim3 dimGrid(nbBlock, 1);
unsigned int* dev_image;
unsigned int* dev_histogram;
unsigned int* dev_partial_histograms;
double* dev_threshold;
double x = 0;
double* threshold = &x;
long int* nbPixels;
long int nb = width*height;
nbPixels = &(nb);
long int* dev_nbPixels;
cudaSetDevice(0);
cudaMalloc((void**)&dev_image, sizeof(unsigned int)*width*height);
cudaMalloc((void**)&dev_histogram, sizeof(unsigned int)* 256);
cudaMalloc((void**)&dev_partial_histograms, sizeof(unsigned int)* 256 * nbBlock);
cudaMalloc((void**)&dev_threshold, sizeof(double));
cudaMalloc((void**)&dev_nbPixels, sizeof(long int));
cudaMemcpy(dev_image, myimage, sizeof(unsigned int)*width*height, cudaMemcpyHostToDevice);
cudaMemcpy(dev_histogram, histogram, sizeof(unsigned int)* 256, cudaMemcpyHostToDevice);
cudaMemcpy(dev_nbPixels, nbPixels, sizeof(long int), cudaMemcpyHostToDevice);
imageKernel<<<dimGrid, dimBlock>>>(dev_image, dev_histogram, dev_nbPixels, dev_threshold);
cudaMemcpy(histogram, dev_histogram, sizeof(unsigned int)* 256, cudaMemcpyDeviceToHost);
cudaMemcpy(partial_histograms, dev_partial_histograms, sizeof(unsigned int)* 256 * nbBlock, cudaMemcpyDeviceToHost);
cudaMemcpy(threshold, dev_threshold, sizeof(double), cudaMemcpyDeviceToHost);
cudaDeviceReset();
return 0;
}
编辑4:我的GPU的特性
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GT 750M"
CUDA Driver Version/Runtime Version 7.5/7.5
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes (2147483648 bytes)
(2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores
GPU Max Clock rate: 1085 MHz (1.09 GHz)
Memory Clock rate: 900 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536),
3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Mo
del)
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID/Bus ID/location ID: 0/1/0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simu
ltaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Versi
on = 7.5, NumDevs = 1, Device0 = GeForce GT 750M
Result = PASS
编辑5:我又跑CUDA-MEMCHECK而这一次,它没有输出错误信息。我不知道为什么它不是第一次,我一定再犯过错。我希望你能原谅我那些犹豫不决和浪费时间。这里是输出消息:
========= CUDA-MEMCHECK
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launc
h failure" on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xb780
2) [0xdb1e2]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0x160f]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xc764]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xfe24]
========= Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk +
0x22) [0x13d2]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x3
4) [0x15454]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launc
h failure" on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xb780
2) [0xdb1e2]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0x160f]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xc788]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xfe24]
========= Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk +
0x22) [0x13d2]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x3
4) [0x15454]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launc
h failure" on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xb780
2) [0xdb1e2]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0x160f]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xc7a6]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xfe24]
========= Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk +
0x22) [0x13d2]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x3
4) [0x15454]
=========
========= ERROR SUMMARY: 3 errors
虽然不是很明显,是吗?
您需要提供主机代码。 – brano
在这类调试问题中,除非您能提供其他人可以复制并粘贴到编辑器中的最短,完整的代码,编译并运行,并且能够重现您的错误,我们无法为您提供帮助。 CUDA附带了用于检测内存访问错误的工具,如cuda-memcheck。你有没有尝试过使用它们? – talonmies
@talonmies我知道这很难 - 如果不是不可能的话 - 就像这样发现错误,但我认为可能存在一个我可能忽略的相对基本的原则。我试过cuda-memcheck,是的,它没有发现任何错误。 – Nico