CUDA内存分配性能

我正在CUDA上使用图像过滤器。图像处理速度比在CPU上快得多。但问题是，图像的分配真的很慢。CUDA内存分配性能

这就是我如何分配内存并设置图像。

hr = cudaMalloc(&m_device.originalImage, size);                   
hr = cudaMalloc(&m_device.modifiedImage, size);                   
hr = cudaMalloc(&m_device.tempImage, size);                 
hr = cudaMemset(m_device.modifiedImage, 0, size);                   
hr = cudaMemcpy(m_device.originalImage, host.originalImage, size, cudaMemcpyHostToDevice);

这里是执行程序的结果。

C:\cpu_gpu_filters(GPU)\x64\Release>cpu_gpu_filters test-case.txt 
C:\Users\Max\Desktop\test_set\cheshire_cat_1280x720.jpg 
Init time: 519 ms 
Time spent: 2.35542 ms 
C:\Users\Max\Desktop\test_set\cheshire_cat_1366x768.jpg 
Init time: 31 ms 
Time spent: 2.68595 ms 
C:\Users\Max\Desktop\test_set\cheshire_cat_1600x900.jpg 
Init time: 44 ms 
Time spent: 3.54835 ms 
C:\Users\Max\Desktop\test_set\cheshire_cat_1920x1080.jpg 
Init time: 61 ms 
Time spent: 4.98131 ms 
C:\Users\Max\Desktop\test_set\cheshire_cat_2560x1440.jpg 
Init time: 107 ms 
Time spent: 9.0727 ms 
C:\Users\Max\Desktop\test_set\cheshire_cat_3840x2160.jpg 
Init time: 355 ms 
Time spent: 20.1453 ms 
C:\Users\Max\Desktop\test_set\cheshire_cat_5120x2880.jpg 
Init time: 449 ms 
Time spent: 35.815 ms 
C:\Users\Max\Desktop\test_set\cheshire_cat_7680x4320.jpg 
Init time: 908 ms 
Time spent: 75.4647 ms

UPD代码时间测量：

start = high_resolution_clock::now(); 
Initialize(); 
stop = high_resolution_clock::now(); 
long long ms = duration_cast<milliseconds>(stop - start).count(); 
long long us = duration_cast<microseconds>(stop - start).count(); 
cout << "Init time: " << ms << " ms" << endl; 


start = high_resolution_clock::now(); 
GpuTimer gpuTimer; 
gpuTimer.Start(); 
RunGaussianBlurKernel(
    m_device.modifiedImage, 
    m_device.tempImage, 
    m_device.originalImage, 
    m_device.filter, 
    m_filter.width, 
    m_host.originalImage.rows, 
    m_host.originalImage.cols 
    ); 
gpuTimer.Stop();

的第一映像是最小的，但初始化需要519毫秒。也许，这是因为有必要加载驱动程序或其他东西。然后，当图像的大小增加时，初始化时间也会增加。实际上，这看起来合乎逻辑，但我仍然不确定初始化过程应该如此缓慢。难道我做错了什么？

来源

2016-05-16 Max

在你的代码中，你在测量开始和结束时间？ – Makketronix

@Makketronix，我很确定我测量时间的方式是正确的，但我更新了问题。问题是初始化是否需要这么多时间是否正常。 – Max

嗯。你在“调试”还是“发布”中构建？ “调试”构建之前，我遇到了性能问题。 – Makketronix

在你的单元代码中，你有一个cudaMemset，执行时间取决于大小。还有cudaMemcpy，其执行时间大约由单位字节的内存副本大小除以PCI-Express的带宽得出。这部分很可能是初始时间增加的原因。通过NSIGHT运行它可以为您提供更精确的执行时间数据。但是，没有MCVE，很难肯定地回答。

来源

2016-05-17 06:12:08

斑点。由于memset调用，初始化时间与图像大小几乎呈线性关系。 – talonmies

我知道初始化时间取决于图像大小。我只是没想到初始化可能需要很长时间。 – Max

@Max如果像素为char [3]，则初始速度仅为100M/s。它看起来像H2D mem复制或磁盘加载，而不是cudaMemset。 – kangshiyin

CUDA内存分配性能

回答

相关问题