tbb :: parallel_reduce和std :: accumulate的结果不同

我在学习Intel's TBB library。当对std::vector中的所有值求和时，tbb::parallel_reduce的结果在向量中的元素多于16.777.220个元素（在16.777.320元素处出现错误）的情况下与std::accumulate不同。这里是我的最低工作，例如：tbb :: parallel_reduce和std :: accumulate的结果不同

#include <iostream> 
#include <vector> 
#include <numeric> 
#include <limits> 
#include "tbb/tbb.h" 

int main(int argc, const char * argv[]) { 

    int count = std::numeric_limits<int>::max() * 0.0079 - 187800; // - 187900 works 

    std::vector<float> heights(size); 
    std::fill(heights.begin(), heights.end(), 1.0f); 

    float ssum = std::accumulate(heights.begin(), heights.end(), 0); 
    float psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<float>::iterator>(heights.begin(), heights.end()), 0, 
             [](tbb::blocked_range<std::vector<float>::iterator> const& range, float init) { 
              return std::accumulate(range.begin(), range.end(), init); 
             }, std::plus<float>() 
            ); 

    std::cout << std::endl << " Heights serial sum: " << ssum << " parallel sum: " << psum; 
    return 0; 
}

这对我的OSX 10.10.3输出具有的XCode 6.3.1和TBB稳定4.3-20141023（从Brew倒）：

Heights serial sum: 1.67772e+07 parallel sum: 1.67773e+07

为什么就是它？我应该向TBB开发者报告错误吗？

附加测试，将您的答案：

correct value is: 1949700403 
cause we add 1.0f to zero 1949700403 times 

using (int) init values: 
Runtime: 17.407 sec. Heights serial sum: 16777216.000, wrong 
Runtime: 8.482 sec. Heights parallel sum: 131127368.000, wrong 

using (float) init values: 
Runtime: 12.594 sec. Heights serial sum: 16777216.000, wrong 
Runtime: 5.044 sec. Heights parallel sum: 303073632.000, wrong 

using (double) initial values: 
Runtime: 13.671 sec. Heights serial sum: 1949700352.000, wrong 
Runtime: 5.343 sec. Heights parallel sum: 263690016.000, wrong 

using (double) initial values and tbb::parallel_deterministic_reduce: 
Runtime: 13.463 sec. Heights serial sum: 1949700352.000, wrong 
Runtime: 99.031 sec. Heights parallel sum: 1949700352.000, wrong >>> almost 10x slower !

为什么所有减少调用产生错误的总和？ (double)不够？ 这里是我的测试代码：

#include <iostream> 
    #include <vector> 
    #include <numeric> 
    #include <limits> 
    #include <sys/time.h> 
    #include <iomanip> 
    #include "tbb/tbb.h" 
    #include <cmath> 

    class StopWatch { 
    private: 
     double elapsedTime; 
     timeval startTime, endTime; 
    public: 
     StopWatch() : elapsedTime(0) {} 
     void startTimer() { 
      elapsedTime = 0; 
      gettimeofday(&startTime, 0); 
     } 
     void stopNprintTimer() { 
      gettimeofday(&endTime, 0); 
      elapsedTime = (endTime.tv_sec - startTime.tv_sec) * 1000.0;    // compute sec to ms 
      elapsedTime += (endTime.tv_usec - startTime.tv_usec)/1000.0;   // compute us to ms and add 
      std::cout << " Runtime: " << std::right << std::setw(6) << elapsedTime/1000 << " sec.";    // show in sec 
     } 
    }; 

    int main(int argc, const char * argv[]) { 

     StopWatch watch; 
     std::cout << std::fixed << std::setprecision(3) << "" << std::endl; 
     size_t count = std::numeric_limits<int>::max() * 0.9079; 

     std::vector<float> heights(count); 
     std::cout << " Vector size: " << count << std::endl; 
     std::fill(heights.begin(), heights.end(), 1.0f); 

     watch.startTimer(); 
     float ssum = std::accumulate(heights.begin(), heights.end(), 0.0); // change type of initial value here 
     watch.stopNprintTimer(); 
     std::cout << " Heights serial sum: " << std::right << std::setw(8) << ssum << std::endl; 

     watch.startTimer(); 
     float psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<float>::iterator>(heights.begin(), heights.end()), 0.0, // change type of initial value here 
              [](tbb::blocked_range<std::vector<float>::iterator> const& range, float init) { 
               return std::accumulate(range.begin(), range.end(), init); 
              }, std::plus<float>() 
             ); 
     watch.stopNprintTimer(); 
     std::cout << " Heights parallel sum: " << std::right << std::setw(8) << psum << std::endl; 

     return 0; 
    }

回答我的最后一个问题：它们都产生错误的结果，因为他们没有为整数加法与大量制造。切换到INT解决了：

[...] 
std::vector<int> heights(count); 
std::cout << " Vector size: " << count << std::endl; 
std::fill(heights.begin(), heights.end(), 1); 

watch.startTimer(); 
int ssum = std::accumulate(heights.begin(), heights.end(), (int)0); 
watch.stopNprintTimer(); 
std::cout << " Heights serial sum: " << std::right << std::setw(8) << ssum << std::endl; 

watch.startTimer(); 
int psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<int>::iterator>(heights.begin(), heights.end()), (int)0, 
            [](tbb::blocked_range<std::vector<int>::iterator> const& range, int init) { 
             return std::accumulate(range.begin(), range.end(), init); 
            }, std::plus<int>() 
           ); 
watch.stopNprintTimer(); 
std::cout << " Heights parallel sum: " << std::right << std::setw(8) << psum << std::endl; 
[...]

结果：

Vector size: 1949700403 
Runtime: 13.041 sec. Heights serial sum: 1949700403, correct 
Runtime: 4.728 sec. Heights parallel sum: 1949700403, correct and almost 4x faster

来源

2015-05-05 VisorZ

浮点运算不是实数运算。如果您更改操作顺序，您可能会得到不同的舍入错误。 – DanielKO

您对std::accumulate呼叫正在做整数加法，那么在计算最终结果转化到float。为了累积浮点数，累加器应该是float^*。

float ssum = std::accumulate(heights.begin(), heights.end(), 0.0f); 
                  ^^^^

^{*或任何其他类型可以正确累积float。}

来源

2015-05-05 12:45:41 juanchopanza

谢谢。正如std :: accumulate模板语法所示，我在这里使用了一个int值，只有幸运的是，我用1.0f填充了我的矢量，当它转换为int时，它是1。当使用浮点值时，结果仍然不正确。但是这次由于float数据类型在较高数字区域中的不准确性。 – VisorZ

这可能会解决这方面的问题给你：

您的电话到std ::积累是做整数加法，然后将结果变换到漂浮在计算结束。

但浮点加法是不是关联操作：

随着累加：（...（（S + A1）+ A2）+ ...）+一个
随着parralel_reduce ：可能的任何括号排列。

http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

来源

2015-05-05 14:04:09

感谢您指出浮动精度问题，并链接到伟大的文档（upvoted）。 – VisorZ

为其他正确答案为 '为什么？'部分，我还补充说，TBB提供了parallel_deterministic_reduce，它保证了在相同数据的两次和多次运行之间可重现的结果（但它仍然可以与std :: accumulate不同）。请参阅the blog描述问题和确定性算法。

因此，关于'我应该向TBB开发者报告错误吗？'部分，答案显然不是（除非你在TBB方面发现不足）。

来源

2015-05-05 14:45:24 Anton

谢谢你的提示。不幸的是，它需要更多的时间在我的4线程英特尔i7比串行std :: accumulate（）与双类型的初始值。 – VisorZ

仔细阅读链接我现在明白，'tbb :: parallel_deterministic_reduce'不会产生正确的结果，但至少会重复出现错误的结果，这意味着每次运行都会产生相同的错误。我可以引用：_重要的是要注意，使用parallel_deterministic_reduce获得的可重复结果可能仍然不同于通过串行执行获得的结果。 [..]此外，该算法不是为了提高计算的准确性。_ – VisorZ

这不是一个错误。或者可以说std :: accumulate – Anton

tbb :: parallel_reduce和std :: accumulate的结果不同

回答

相关问题