将工作划分为更多线程需要更多时间，为什么？

我有一个小型的C程序，它使用monte-carlo计算pi-这个模拟基本上只是测试一个随机点[x，y]，如果它位于一个圆内或圆外。将工作划分为更多线程需要更多时间，为什么？

为了接近PI我不得不使用大量的样品Ñ的，其具有的O（n）的一个成正比的复杂性。所以试图计算大量的样本n，我实现了POSIX threads api来平行计算能力。

我的代码如下所示：

pthread_t worker[nthreads]; /* creates workers for each thread */ 
struct param aparam[nthreads]; /* struct param{ long* hits; long rounds; }; */ 
long nrounds = nsamples/nthreads; /* divide samples to subsets of equal rounds per thread */ 

for (int i = 0; i < nthreads; ++i) { /* loop to create threads */ 
    aparam[i].hits = 0; 
    aparam[i].rounds = nrounds; 
    pthread_create(&worker[i], NULL, calc_pi, &aparam[i]); /* calls calc_pi(void* vparam){} */ 
} 

long nhits = 0; 
for (int j = 0; j < nthreads; ++j) { /* collects results */ 
    pthread_join(worker[j], NULL); 
    nhits += (long)aparam[j].hits; /* counts hits inside the cicrle */ 
}

而这就是每个线程正在做：

void* calc_pi(void* vparam) 
{ /* counts hits inside a circle */ 
    struct param *iparam; 
    iparam = (struct param *) vparam; 
    long hits = 0; 
    float x, y, z; 
    for (long i = 0; i < iparam->rounds; ++i) { 
     x = (float)rand()/RAND_MAX; 
     y = (float)rand()/RAND_MAX; 
     z = x * x + y * y; 
     if (z <= 1.f) /* circle radius of 1 */ 
      ++hits; 
    } 
    iparam->hits = (long*)hits; 
    return NULL; 
}

现在我有一个奇怪的观察。使用相同的样本集n并且具有越来越多的线程i该程序需要更多的时间而不是更少的时间。

这里有一些平均运行时间（可再现的）：

------------------------------------------------- 
| Threads[1] | Samples[1] | Rounds[1] | Time[s] | 
------------------------------------------------- 
|  32 | 268435456 | 8388608 | 118 | 
|  16 | 268435456 | 16777216 | 106 | 
|   8 | 268435456 | 33554432 | 125 | 
|   4 | 268435456 | 67108864 | 152 | 
|   2 | 268435456 | 134217728 |  36 | 
|   1 | 268435456 | 268435456 |  15 | 
-------------------------------------------------

为什么例如两个线程做同样的工作时间比两倍的时间比一个单独的线程吗？我的假设是，将工作划分的两个线程应该将时间减少至少50％。

与GCC 4.9.1编译和以下标志：

gcc -O2 -std=gnu11 -pthread pipa.c -lpthread -o pipa

我的硬件是一个双Intel Xeon E5520（2个处理器，每4个内核）@ 2.26 GHz的超线程禁用，运行Linux的科学与2.6 .18内核。

任何想法？

来源

2014-11-22 default

Linux 2.6.18是古老的。像，史前。我很肯定自那时起对多线程程序进行了很多改进。例如，你使用哪种pthreads-implementation？ LinuxThreads或NPTL？ – EOF 2014-11-22 11:16:47

你确定线程在不同的内核上运行吗？也许由于某些原因，他们共享相同的内核，所以上下文切换开销会增加rand（）可能导致争用的运行时间 – 2014-11-22 11:18:21

。 – 2501 2014-11-22 11:24:36

您的线程执行的最昂贵的操作是呼叫rand()。 rand()是一个朴素，简单，并且一般不具有MT可扩展功能（因为它保证了相同的种子产生相同的序列随机数字）。（*）

一个简单的技巧，以确认是否是问题，是在调试器下启动程序，然后，几次：暂停它，捕获线程的堆栈跟踪继续。无论堆栈轨迹中最常出现的是什么，很可能是瓶颈。（*）使它更慢的原因是锁定争用会导致额外的性能损失。此外，许多线程还会增加进程调度和上下文切换的额外开销。

来源

2014-11-22 11:52:16 Dummy00001

事实上，使用'rand_r（）'解决了这个问题。 – default 2014-11-22 12:03:21

将工作划分为更多线程需要更多时间，为什么？

回答

相关问题