2014-07-17 26 views
0

我需要在我的程序中更改哪些内容才能计算素数上限?如何使用Eratosthenes筛子增加生成的素数限制?

  • 目前我的算法只适用于数字高达8500万。我认为应该使用高达30亿的数字。

我正在CUDA编写自己的Eratosthenes Sieve实现,并且我打了一堵墙。 到目前为止,该算法似乎适用于少数(8500万以下)的罚款。但是,当我尝试计算高达1亿,20亿和30亿的素数时,系统冻结(虽然它是CUDA设备中的计算材料),然后几秒钟后,我的Linux机器返回到正常(解冻),但CUDA程序崩溃与以下错误消息:

CUDA error at prime.cu:129 code=6(cudaErrorLaunchTimeout) "cudaDeviceSynchronize()" 

我有一个GTX 780(3 GB),我在字符数组分配筛,所以如果我是计算素数高达100,000,它会在设备中分配100,000个字节。

我假定GPU将允许最多3个十亿的数字,因为它有3 GB的内存,但是,它只是让我做8500万顶(85万个字节= 0.08 GB)

这是我prime.cu代码:

#include <stdio.h> 
#include <helper_cuda.h> // checkCudaErrors() - NVIDIA_CUDA-6.0_Samples/common/inc 
// #include <cuda.h> 
// #include <cuda_runtime_api.h> 
// #include <cuda_runtime.h> 

typedef unsigned long long int uint64_t; 

/****************************************************************************** 
* kernel that initializes the 1st couple of values in the primes array. 
******************************************************************************/ 
__global__ static void sieveInitCUDA(char* primes) 
{ 
    primes[0] = 1; // value of 1 means the number is NOT prime 
    primes[1] = 1; // numbers "0" and "1" are not prime numbers 
} 

/****************************************************************************** 
* kernel for sieving the even numbers starting at 4. 
******************************************************************************/ 
__global__ static void sieveEvenNumbersCUDA(char* primes, uint64_t max) 
{ 
    uint64_t index = blockIdx.x * blockDim.x + threadIdx.x + threadIdx.x + 4; 
    if (index < max) 
     primes[index] = 1; 
} 

/****************************************************************************** 
* kernel for finding prime numbers using the sieve of eratosthenes 
* - primes: an array of bools. initially all numbers are set to "0". 
*   A "0" value means that the number at that index is prime. 
* - max: the max size of the primes array 
* - maxRoot: the sqrt of max (the other input). we don't wanna make all threads 
* compute this over and over again, so it's being passed in 
******************************************************************************/ 
__global__ static void sieveOfEratosthenesCUDA(char *primes, uint64_t max, 
               const uint64_t maxRoot) 
{ 
    // get the starting index, sieve only odds starting at 3 
    // 3,5,7,9,11,13... 
    /* int index = blockIdx.x * blockDim.x + threadIdx.x + threadIdx.x + 3; */ 

    // apparently the following indexing usage is faster than the one above. Hmm 
    int index = blockIdx.x * blockDim.x + threadIdx.x + 3; 

    // make sure index won't go out of bounds, also don't start the execution 
    // on numbers that are already composite 
    if (index < maxRoot && primes[index] == 0) 
    { 
     // mark off the composite numbers 
     for (int j = index * index; j < max; j += index) 
     { 
     primes[j] = 1; 
     } 
    } 
} 

/****************************************************************************** 
* checkDevice() 
******************************************************************************/ 
__host__ int checkDevice() 
{ 
    // query the Device and decide on the block size 
    int devID = 0; // the default device ID 
    cudaError_t error; 
    cudaDeviceProp deviceProp; 
    error = cudaGetDevice(&devID); 
    if (error != cudaSuccess) 
    { 
     printf("CUDA Device not ready or not supported\n"); 
     printf("%s: cudaGetDevice returned error code %d, line(%d)\n", __FILE__, error, __LINE__); 
     exit(EXIT_FAILURE); 
    } 

    error = cudaGetDeviceProperties(&deviceProp, devID); 
    if (deviceProp.computeMode == cudaComputeModeProhibited || error != cudaSuccess) 
    { 
     printf("CUDA device ComputeMode is prohibited or failed to getDeviceProperties\n"); 
     return EXIT_FAILURE; 
    } 

    // Use a larger block size for Fermi and above (see compute capability) 
    return (deviceProp.major < 2) ? 16 : 32; 
} 

/****************************************************************************** 
* genPrimesOnDevice 
* - inputs: limit - the largest prime that should be computed 
*   primes - an array of size [limit], initialized to 0 
******************************************************************************/ 
__host__ void genPrimesOnDevice(char* primes, uint64_t max) 
{ 
    int blockSize = checkDevice(); 
    if (blockSize == EXIT_FAILURE) 
     return; 

    char* d_Primes = NULL; 
    int sizePrimes = sizeof(char) * max; 
    uint64_t maxRoot = sqrt(max); 

    // allocate the primes on the device and set them to 0 
    checkCudaErrors(cudaMalloc(&d_Primes, sizePrimes)); 
    checkCudaErrors(cudaMemset(d_Primes, 0, sizePrimes)); 

    // make sure that there are no errors... 
    checkCudaErrors(cudaPeekAtLastError()); 

    // setup the execution configuration 
    dim3 dimBlock(blockSize); 
    dim3 dimGrid((maxRoot + dimBlock.x)/dimBlock.x); 
    dim3 dimGridEvens(((max + dimBlock.x)/dimBlock.x)/2); 

    //////// debug 
    #ifdef DEBUG 
    printf("dimBlock(%d, %d, %d)\n", dimBlock.x, dimBlock.y, dimBlock.z); 
    printf("dimGrid(%d, %d, %d)\n", dimGrid.x, dimGrid.y, dimGrid.z); 
    printf("dimGridEvens(%d, %d, %d)\n", dimGridEvens.x, dimGridEvens.y, dimGridEvens.z); 
    #endif 

    // call the kernel 
    // NOTE: no need to synchronize after each kernel 
    // http://stackoverflow.com/a/11889641/2261947 
    sieveInitCUDA<<<1, 1>>>(d_Primes); // launch a single thread to initialize 
    sieveEvenNumbersCUDA<<<dimGridEvens, dimBlock>>>(d_Primes, max); 
    sieveOfEratosthenesCUDA<<<dimGrid, dimBlock>>>(d_Primes, max, maxRoot); 

    // check for kernel errors 
    checkCudaErrors(cudaPeekAtLastError()); 
    checkCudaErrors(cudaDeviceSynchronize()); 

    // copy the results back 
    checkCudaErrors(cudaMemcpy(primes, d_Primes, sizePrimes, cudaMemcpyDeviceToHost)); 

    // no memory leaks 
    checkCudaErrors(cudaFree(d_Primes)); 
} 

来测试该代码:

int main() 
{ 
    int max = 85000000; // 85 million 
    char* primes = malloc(max); 
    // check that it allocated correctly... 
    memset(primes, 0, max); 

    genPrimesOnDevice(primes, max); 

    // if you wish to display results: 
    for (uint64_t i = 0; i < size; i++) 
    { 
     if (primes[i] == 0) // if the value is '0', then the number is prime 
     { 
     std::cout << i; // use printf if you are using c 
     if ((i + 1) != size) 
      std::cout << ", "; 
     } 
    } 

    free(primes); 

} 
+0

质数的并行计算有更高效的筛选技术。看看[Atkin-Bernsein](http://en.wikipedia.org/wiki/Sieve_of_Atkin)筛子,例如 – talonmies

回答

2

此错误:

CUDA error at prime.cu:129 code=6(cudaErrorLaunchTimeout) "cudaDeviceSynchronize()" 

并不一定意味着你的内核花费的时间太长。它不一定是一个数字限制或计算错误,而是系统对内核允许运行的时间限制。 Linux和Windows都可以有这样的看门狗定时器。

如果你想在linux的情况下解决它,查看this document

你没有提到它,但我认为你的GTX780也是托管一个()显示器。在这种情况下,默认情况下内核有时间限制。如果您可以使用其他设备作为显示器,则重新配置您的机器让X不使用GTX780,如链接中所述。如果您没有其他GPU用于显示,那么唯一的选择是修改链接文档中指示的交互性设置,如果您想运行长时间运行的内核。在这种情况下,键盘/鼠标/显示器在内核运行时将变得无法响应。如果内核运行时间过长,则可能很难恢复机器,并且可能需要重新启动。 (您也可以通过SSH连接到机器,并终止使用GPU进行CUDA的过程。)

+0

是的,我的显卡也用于我的显示器 – Unglued

+0

,这也解释了为什么我的屏幕死机:) – Unglued