为什么在我的情况下多线程比顺序编程慢？

我是新来的多线程，并尝试通过一个简单的程序来学习，它将1加到n并返回总和。在顺序情况下，main为n = 1e5和2e5调用sumFrom1函数两次;在多线程情况下，使用pthread_create创建两个线程，并且在单独的线程中计算两个和。多线程版本比顺序版本慢得多（请参阅下面的结果）。我在12-CPU平台上运行它，线程之间没有通信。为什么在我的情况下多线程比顺序编程慢？

多线程：

Thread 1 returns: 0 
Thread 2 returns: 0 
sum of 1..10000: 50005000 
sum of 1..20000: 200010000 
time: 156 seconds

顺序：

sum of 1..10000: 50005000 
sum of 1..20000: 200010000 
time: 56 seconds

当我添加-02在编译，多线程版本（787-9）的时间小于的顺序版本（11S），但并不像我预期的那么多。我始终可以使用-O2标志，但我对未优化的情况下多线程的低速度感到好奇。它应该比顺序版本慢吗？如果不是，我能做些什么来加快速度？

代码：

#include <stdio.h> 
#include <pthread.h> 
#include <time.h> 

typedef struct my_struct 
{ 
    int n;                                        
    int sum;                                        
}my_struct_t;                                       

void *sumFrom1(void* sit)                                    
{                                          
    my_struct_t* local_sit = (my_struct_t*) sit;                               
    int i;                                        
    int nsim = 500000; // Loops for consuming time                                     
    int j;                                        

    for(j = 0; j < nsim; j++)                                   
    {                                         
    local_sit->sum = 0;                                     
    for(i = 0; i <= local_sit->n; i++)                                 
     local_sit->sum += i;                                    
    }  
} 

int main(int argc, char *argv[])                                  
{                                          
    pthread_t thread1;                                    
    pthread_t thread2;                                    
    my_struct_t si1;                                     
    my_struct_t si2;                                     
    int   iret1;                                     
    int   iret2;                                     
    time_t  t1;                                      
    time_t  t2;                                      


    si1.n = 10000;                                      
    si2.n = 20000;                                      

    if(argc == 2 && atoi(argv[1]) == 1) // Use "./prog 1" to test the time of multithreaded version                                 
    {                                         
    t1 = time(0);                                      
    iret1 = pthread_create(&thread1, NULL, sumFrom1, (void*)&si1);  
    iret2 = pthread_create(&thread2, NULL, sumFrom1, (void*)&si2);                          
    pthread_join(thread1, NULL);                                  
    pthread_join(thread2, NULL);                                  
    t2 = time(0);                                      

    printf("Thread 1 returns: %d\n",iret1);                               
    printf("Thread 2 returns: %d\n",iret2);                               
    printf("sum of 1..%d: %d\n", si1.n, si1.sum);                              
    printf("sum of 1..%d: %d\n", si2.n, si2.sum);                              
    printf("time: %d seconds", t2 - t1);                                

    }                                         
    else  // Use "./prog" to test the time of sequential version                                       
    {                                         
    t1 = time(0);                                      
    sumFrom1((void*)&si1);                                    
    sumFrom1((void*)&si2);                                    
    t2 = time(0);                                      

    printf("sum of 1..%d: %d\n", si1.n, si1.sum);                              
    printf("sum of 1..%d: %d\n", si2.n, si2.sum);                              
    printf("time: %d seconds", t2 - t1); 
    }                        
    return 0;                       
}

UPDATE1：

的 “假共享” 有点谷歌搜索后（感谢@马丁詹姆斯！），我认为这是主要的原因。有（至少）两种方式来解决这个问题：

第一种方法是将两个结构之间的缓冲地带（谢谢，@dasblinkenlight）：

my_struct_t si1; 
char   memHolder[4096]; 
my_struct_t si2;

没有-02，时间消耗量从〜156s降至〜38s。

第二种方法是经常避免sumFrom1更新sit->sum，这可以使用一个临时变量来实现（如@Jens Gustedt回答）：

for(int sum = 0, j = 0; j < nsim; j++)    
{ 
    sum = 0; 
    for(i = 0; i <= local_sit->n; i++) 
    sum += i; 
} 
local_sit->sum = sum;

没有-O2，消耗的时间从减少〜 156s〜35s或〜109s（它有两个高峰！我不知道为什么。）。用-O2，耗时约8秒。

来源

2012-04-11 cogitovita

在这样的测试中，我们需要对结果进行平均。使用-O2优化运行测试多少次？如果你已经跑了好几次，平均时间是多少？ – 2012-04-11 09:23:53

si1和si2彼此相邻。虚假分享？ – 2012-04-11 09:32:56

@PavanManjunath感谢您的建议。我用-O2跑了10次。多线程版本的平均时间为7.9秒，顺序版本的平均时间为11.7秒。波动很小。 – cogitovita 2012-04-11 09:37:34

通过修改代码以

typedef struct my_struct 
{ 
    size_t n; 
    size_t sum; 
}my_struct_t; 

void *sumFrom1(void* sit) 
{ 
    my_struct_t* local_sit = sit; 
    size_t nsim = 500000; // Loops for consuming time 
    size_t n = local_sit->n; 
    size_t sum = 0; 
    for(size_t j = 0; j < nsim; j++) 
    { 
    for(size_t i = 0; i <= n; i++) 
     sum += i; 
    } 
    local_sit->sum = sum; 
    return 0; 
}

现象消失。您遇到的问题：

使用int的数据类型是完全错误的，这样的测试。你的数字在那里，总和溢出。签名类型的溢出是未定义的行为。你很幸运，它没有吃你的午餐。
有间接和求和变量间接购买你额外的加载和存储，在-O0的情况下，真的完成这样，所有的假分享和类似的东西的影响。

你的代码也观察到其他错误：

缺失包括atoi
superflouous投地，并从void*
的time_t为int

请编译打印您的代码-Wall之前pos婷。

来源

2012-04-11 10:26:33

使用'size_t sum = 0;'导致显着的性能提升，然后添加这个'size_t n = local_sit-> n;'再次减慢速度。任何想法为什么？（全部用-O0编译） – alk 2012-04-11 11:04:40

不，不是真的，我认为讨论非优化的代码对于这个细节没什么意义，如果你想知道真正发生了什么，第一步是查看用'-S'生成的汇编程序。 – 2012-04-11 11:18:24

为什么在我的情况下多线程比顺序编程慢？

回答

相关问题