2014-11-03 44 views
0

过这个高效的分段黄金筛来到在互联网上,请帮我了解工作尤其是使用下一个矢量分段总理筛

如何被分片大小的具体选择影响性能的?

const int L1D_CACHE_SIZE = 32768; 
void segmented_sieve(int64_t limit, int segment_size = L1D_CACHE_SIZE) 
{ 
    int sqrt = (int) std::sqrt((double) limit); 
    int64_t count = (limit < 2) ? 0 : 1; 
    int64_t s = 2; 
    int64_t n = 3; 

    // vector used for sieving 
    std::vector<char> sieve(segment_size); 

    // generate small primes <= sqrt 
    std::vector<char> is_prime(sqrt + 1, 1); 
    for (int i = 2; i * i <= sqrt; i++) 
     if (is_prime[i]) 
      for (int j = i * i; j <= sqrt; j += i) 
       is_prime[j] = 0; 

    std::vector<int> primes; 
    std::vector<int> next; 

    for (int64_t low = 0; low <= limit; low += segment_size) 
    { 
     std::fill(sieve.begin(), sieve.end(), 1); 

     // current segment = interval [low, high] 
     int64_t high = std::min(low + segment_size - 1, limit); 

     // store small primes needed to cross off multiples 
     for (; s * s <= high; s++) 
     { 
      if (is_prime[s]) 
      { 
       primes.push_back((int) s); 
       next.push_back((int)(s * s - low)); 
      } 
     } 
     // sieve the current segment 
     for (std::size_t i = 1; i < primes.size(); i++) 
     { 
      int j = next[i]; 
      for (int k = primes[i] * 2; j < segment_size; j += k) 
       sieve[j] = 0; 
      next[i] = j - segment_size; 
     } 

     for (; n <= high; n += 2) 
      if (sieve[n - low]) // n is a prime 
       count++; 
    } 

    std::cout << count << " primes found." << std::endl; 
} 
+0

保留你的载体,你push_back。 – 2014-11-03 18:27:29

回答

1

下面是相同算法的更简洁的表述,它应该使原理更清晰(full, runnable .cpp for segment size timings @ pastebin的一部分)。这初始化了一个包装(仅赔率)的筛子,而不是计数素数,但涉及的原理是相同的。下载并运行.cpp以查看分段大小的影响。基本上,最佳值应该在CPU的L1高速缓存大小附近。太小,由于轮数增加而导致的开销开始占主导地位;太大了,你会因L2和L3缓存的较慢时间而受到惩罚。另见How does segmentation improve the running time of Sieve of Eratosthenes?

void initialise_packed_sieve_4G (void *data, unsigned segment_bytes = 1 << 15, unsigned end_bit = 1u << 31) 
{ 
    typedef std::vector<prime_and_offset_t>::iterator prime_iter_t; 
    std::vector<prime_and_offset_t> small_factors; 

    initialise_odd_primes_and_offsets_64K(small_factors); 

    unsigned segment_bits = segment_bytes * CHAR_BIT; 
    unsigned partial_bits = end_bit % segment_bits; 
    unsigned segments  = end_bit/segment_bits + (partial_bits != 0); 

    unsigned char *segment = static_cast<unsigned char *>(data); 
    unsigned bytes = segment_bytes; 

    for (; segments--; segment += segment_bytes) 
    { 
     if (segments == 0 && partial_bits) 
     { 
     segment_bits = partial_bits; 
     bytes = (partial_bits + CHAR_BIT - 1)/CHAR_BIT; 
     } 

     std::memset(segment, 0, bytes); 

     for (prime_iter_t p = small_factors.begin(); p != small_factors.end(); ++p) 
     { 
     unsigned n = p->prime; 
     unsigned i = p->next_offset; 

     for (; i < segment_bits; i += n) 
     { 
      set_bit(segment, i); 
     } 

      p->next_offset = i - segment_bits; 
     } 
    } 
} 

如果偏移不是从段段记得那么就要使用至少一种划分,每个重计算指数一次乘法,加上条件语句或严重位弄虚作假每次重新计算。当筛选完整的2^32数字范围(每个32 KB的8192个段)至少有53,583,872个慢分割和相同数量的稍微更快的乘法;这大概需要一秒钟时间才能完成初始化2^32筛子所需的时间(2^31比特用于赔率Eratosthenes)。

下面是我的大筛子之一一些实际的代码,使用这种“reconstitutive的数学:

for (index_t k = 1; k <= max_factor_bit; ++k) 
{ 
    if (bitmap_t::traits::bt(bm.bm, k)) continue; 

    index_t n = (k << 1) + 1;  // == index_for_value(value_for_index(k) * 2) == n 
    index_t i = square(n) >> 1; // == index_for_value(square(n)) 

    if (i < offset) 
    { 
     i += ((offset - i)/n) * n; 
    } 

    for (; i <= new_max_bit; i += n) 
    { 
     bitmap_t::traits::bts(bm.bm, i); 
    } 
} 

这需要大约5.5秒的全筛(VC++);首先显示的代码只需要4.5秒,而使用gcc 4.8.1(MinGW64)则只需要3.5秒。

下面是GCC定时:

sieve bits = 2147483648 (equiv. number = 4294967295) 

segment size 4096 (2^12) bytes ... 4.091 s 1001.2 M/s 
segment size 8192 (2^13) bytes ... 3.723 s 1100.2 M/s 
segment size 16384 (2^14) bytes ... 3.534 s 1159.0 M/s 
segment size 32768 (2^15) bytes ... 3.418 s 1198.4 M/s 
segment size 65536 (2^16) bytes ... 3.894 s 1051.9 M/s 
segment size 131072 (2^17) bytes ... 4.265 s 960.4 M/s 
segment size 262144 (2^18) bytes ... 4.453 s 919.8 M/s 
segment size 524288 (2^19) bytes ... 5.002 s 818.9 M/s 
segment size 1048576 (2^20) bytes ... 5.176 s 791.3 M/s 
segment size 2097152 (2^21) bytes ... 5.135 s 797.7 M/s 
segment size 4194304 (2^22) bytes ... 5.251 s 780.0 M/s 
segment size 8388608 (2^23) bytes ... 7.412 s 552.6 M/s 

digest { 203280221, 0C903F86, 5B253F12, 774A3204 } 

注意:另外的第二可从那个时候通过被称为“presieving”特技,即爆破预先计算的图案成位映像,而不是零它的被剃在开始时。这将gcc时间降至2.1秒,完成筛选,其中this hacked copy of the earlier .cpp。这个技巧与缓存大小的分段筛分非常吻合。

0

我不是专家在这一点,但我的直觉告诉我:

  1. 限制筛检索表

    以适应CPU 的L1缓存获取的全部好处当前硬件架构的性能提升

  2. next vector

    ,如果你想segmentate筛 那么你必须记住已过筛每每一个素的最后一个索引,例如

    • 过筛素数:2,3,5
    • 段大小:8

      |0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7| // segments 
      ----------------------------------------------- 
      2|- x x x x x x x x x x x 
      3|-  x  x  x  x  x  x  x 
      5|-   x   x   x   x  
      ----------------------------------------------- 
      |    ^   ^   ^
                // next value offset for each prime 
      

    做馅下一个环节,如果您继续顺利...

+0

@PratikKumar btw出于某种目的(如非订购/统一使用IsPrime?测试)是周期筛更快的整体性能看这里(它可能会让你感兴趣)http://stackoverflow.com/a/22477240/2521214 – Spektre 2014-11-04 08:03:19