2016-12-28 71 views
-1

我有两台机器都安装了MS MPI 7.1,一台名为SERVER,另一台名为COMPUTE。 这些机器在一个简单的Windows工作组(无DA)中设置在局域网上,并且都有一个具有相同名称和密码的帐户。MS MPI权限错误

两者都运行MSMPILaunchSvc服务。 两种机器可以在本地执行MPI作业,通过在上机器本身的终端与hostname命令

SERVER> mpiexec -hosts 1 SERVER 1 hostname 
SERVER 
or 
COMPUTE> mpiexec -hosts 1 COMPUTE 1 hostname 
COMPUTE 

测试验证。

我已经禁用了两台机器上的防火墙,以使事情更轻松。

我的问题是我不能让MPI到远程主机上运行的服务器作业:

1:MSMPILaunchSvc服务器 - >计算与MSMPILaunchSvc

SERVER> mpiexec -hosts 1 COMPUTE 1 hostname -pwd 
ERROR: Failed RpcCliCreateContext error 1722 

Aborting: mpiexec on SERVER is unable to connect to the smpd service on COMPUTE:8677 
Other MPI error, error stack: 
connect failed - The RPC server is unavailable. (errno 1722) 

什么是更令人沮丧这里仅在有时候我会提示输入密码。它建议SERVER \ Maarten作为COMPUTE的用户,我已经在SERVER上登录帐户,并且不应该存在于COMPUTE(应该是COMPUTE \ Maarten然后?)。然而它也失败:

SERVER>mpiexec -hosts 1 COMPUTE 1 hostname.exe -pwd 
Enter Password for SERVER\Maarten: 
Save Credentials[y|n]? n 
ERROR: Failed to connect to SMPD Manager Instance error 1726 

Aborting: mpiexec on SERVER is unable to connect to the 
smpd manager on COMPUTE:50915 error 1726 

2:用MSMPILaunchSvc COMPUTE - >服务器MSMPILaunchSvc

COMPUTE> mpiexec -hosts 1 SERVER 1 hostname -pwd 
ERROR: Failed RpcCliCreateContext error 5 

Aborting: mpiexec on COMPUTE is unable to connect to the smpd service on SERVER:8677 
Other MPI error, error stack: 
connect failed - Access is denied. (errno 5) 

3:计算与MSMPILaunchSvc - >服务器SMPD守护程序

Aborting: mpiexec on COMPUTE is unable to connect to the smpd service on SERVER:8677 
Other MPI error, error stack: 
connect failed - Access is denied. (errno 5) 

4:带有MSMPILaunchSvc的服务器 - >带有smpd守护程序的计算机

ERROR: Failed to connect to SMPD Manager Instance error 1726 

Aborting: mpiexec on SERVER is unable to connect to the smpd manager on 
COMPUTE:51022 error 1726 

更新:

两个节点上SMPD守护尝试我得到这个错误:

[-1:9796] Authentication completed. Successfully obtained Context for Client. 
[-1:9796] version check complete, using PMP version 3. 
[-1:9796] create manager process (using smpd daemon credentials) 
[-1:9796] smpd reading the port string from the manager 
[-1:9848] Launching smpd manager instance. 
[-1:9848] created set for manager listener, 376 
[-1:9848] smpd manager listening on port 51149 
[-1:9796] closing the pipe to the manager 
[-1:9848] Authentication completed. Successfully obtained Context for Client. 
[-1:9848] Authorization completed. 
[-1:9848] version check complete, using PMP version 3. 
[-1:9848] Received session header from parent id=1, parent=0, level=0 
[01:9848] Connecting back to parent using host SERVER and endpoint 17979 
[01:9848] Previous attempt failed with error 5, trying to authenticate without Kerberos 
[01:9848] Failed to connect back to parent error 5. 
[01:9848] ERROR: Failed to connect back to parent 'ncacn_ip_tcp:SERVER:17979' error 5 
[01:9848] smpd manager successfully stopped listening. 
[01:9848] SMPD exiting with error code 4294967293. 

,并在主机上:

[-1:12264] Launching SMPD service. 
[-1:12264] smpd listening on port 8677 
[-1:12264] Authentication completed. Successfully obtained Context for Client. 
[-1:12264] version check complete, using PMP version 3. 
[-1:12264] create manager process (using smpd daemon credentials) 
[-1:12264] smpd reading the port string from the manager 
[-1:16668] Launching smpd manager instance. 
[-1:16668] created set for manager listener, 364 
[-1:16668] smpd manager listening on port 18033 
[-1:12264] closing the pipe to the manager 
[-1:16668] Authentication completed. Successfully obtained Context for Client. 
[-1:16668] Authorization completed. 
[-1:16668] version check complete, using PMP version 3. 
[-1:16668] Received session header from parent id=1, parent=0, level=0 
[01:16668] Connecting back to parent using host SERVER and endpoint 18031 
[01:16668] Authentication completed. Successfully obtained Context for Client. 
[01:16668] Authorization completed. 
[01:16668] handling command SMPD_CONNECT src=0 
[01:16668] now connecting to COMPUTE 
[01:16668] 1 -> 2 : returning SMPD_CONTEXT_LEFT_CHILD 
[01:16668] using spn msmpi/COMPUTE to contact server 
[01:16668] SERVER posting a re-connect to COMPUTE:51161 in left child context. 
[01:16668] ERROR: Failed to connect to SMPD Manager Instance error 1726 
[01:16668] sending abort command to parent context. 
[01:16668] posting command SMPD_ABORT to parent, src=1, dest=0. 
[01:16668] ERROR: smpd running on SERVER is unable to connect to smpd service on COMPUTE:8677 
[01:16668] Handling cmd=SMPD_ABORT result 
[01:16668] cmd=SMPD_ABORT result will be handled locally 
[01:16668] parent terminated unexpectedly - initiating cleaning up. 
[01:16668] no child processes to kill - exiting with error code -1 

回答

1

我试用后发现,当尝试以不同配置运行MS MPI时出现这些错误和其他非特定错误(在我的错误中) HPC Cluster 2008和HPC Cluster 2012与MSMPI混合使用)。

解决方案是使用HPC群集2008将所有节点降级到Windows Server 2008 R2。因为我不使用AD,所以必须回退到使用SMPD守护程序并为其添加防火墙规则(跳过集群管理工具全部)。