我想知道是否有人可以提供解释。异常MPI行为
我的代码开始:
/*
Barrier implemented using tournament-style coding
*/
// Constraints: Number of processes must be a power of 2, e.g.
// 2,4,8,16,32,64,128,etc.
#include <mpi.h>
#include <stdio.h>
#include <unistd.h>
void mybarrier(MPI_Comm);
// global debug bool
int verbose = 1;
int main(int argc, char * argv[]) {
int rank;
int size;
int i;
int sum = 0;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int check = size;
// check to make sure the number of processes is a power of 2
if (rank == 0){
while(check > 1){
if (check % 2 == 0){
check /= 2;
} else {
printf("ERROR: The number of processes must be a power of 2!\n");
MPI_Abort(MPI_COMM_WORLD, 1);
return 1;
}
}
}
// simple task, with barrier in the middle
for (i = 0; i < 500; i++){
sum ++;
}
mybarrier(MPI_COMM_WORLD);
for (i = 0; i < 500; i++){
sum ++;
}
if (verbose){
printf("process %d arrived at finalize\n", rank);
}
MPI_Finalize();
return 0;
}
void mybarrier(MPI_Comm comm){
// MPI variables
int rank;
int size;
int * data;
MPI_Status * status;
// Loop variables
int i;
int a;
int skip;
int complete = 0;
int currentCycle = 1;
// Initialize MPI vars
MPI_Comm_rank(comm, &rank);
MPI_Comm_size(comm, &size);
// step 1, gathering
while (!complete){
skip = currentCycle * 2;
// if currentCycle divides rank evenly, then it is a target
if ((rank % currentCycle) == 0){
// if skip divides rank evenly, then it needs to receive
if ((rank % skip) == 0){
MPI_Recv(data, 0, MPI_INT, rank + currentCycle, 99, comm, status);
if (verbose){
printf("1: %d from %d\n", rank, rank + currentCycle);
}
// otherwise, it needs to send. Once sent, the process is done
} else {
if (verbose){
printf("1: %d to %d\n", rank, rank - currentCycle);
}
MPI_Send(data, 0, MPI_INT, rank - currentCycle, 99, comm);
complete = 1;
}
}
currentCycle *= 2;
// main process will never send, so this code will allow it to complete
if (currentCycle >= size){
complete = 1;
}
}
complete = 0;
currentCycle = size/2;
// step 2, scattering
while (!complete){
// if currentCycle is 1, then this is the last loop
if (currentCycle == 1){
complete = 1;
}
skip = currentCycle * 2;
// if currentCycle divides rank evenly then it is a target
if ((rank % currentCycle) == 0){
// if skip divides rank evenly, then it needs to send
if ((rank % skip) == 0){
if (verbose){
printf("2: %d to %d\n", rank, rank + currentCycle);
}
MPI_Send(data, 0, MPI_INT, rank + currentCycle, 99, comm);
// otherwise, it needs to receive
} else {
if (verbose){
printf("2: %d waiting for %d\n", rank, rank - currentCycle);
}
MPI_Recv(data, 0, MPI_INT, rank - currentCycle, 99, comm, status);
if (verbose){
printf("2: %d from %d\n", rank, rank - currentCycle);
}
}
}
currentCycle /= 2;
}
}
预期的行为
的代码来增加了一笔500,等待所有其他进程使用阻塞MPI_SEND和MPI_RECV调用达到这一点,然后递增金额为1000
观察到的行为集群
集群行为与预期
在我的机器上观察到的异常行为
main函数中的所有进程都报告为99,我已经特别链接到mybarrier的第二个while循环的标记。
此外
我的第一稿书面for循环,并与之一,程序执行群集上像预期的那样好,但在我的机器上执行无法完成,即使所有进程调用MPI_Finalize(但没有一个超出它)。
MPI版本
我的机器运行OpenRTE 2.0.2 集群运行OpenRTE 1.6.3
的问题
我观察到我的机器似乎意外地运行所有的时间,而群集正常执行。在我编写的其他MPI代码中也是如此。我不知道1.6.3和2.0.2之间有重大变化吗?
无论如何,我很困惑,我想知道是否有人可以提供一些解释,为什么我的机器似乎没有正确运行MPI。我希望我已经提供了足够的细节,但如果没有,我会很乐意提供您需要的任何其他信息。
我相信固定它。谢谢你的帮助。我是MPI的新手,所以我仍然试图弄清楚一切是如何运作的。 – electr0sheep