2017-09-25 10 views
0

我想用TensorFlow使用mpi。对于这样的代码的例子,see this OpenAI baselines PPO code。它告诉我们,运行以下命令:在tensorflow中使用mpirun -np X:是否受限于GPU的数量?

$ mpirun -np 8 python -m baselines.ppo1.run_atari 

我有一台机器与一个GPU(与12GB的RAM)和Tensorflow 1.3.0安装,使用Python 3.5.3。当我运行这段代码,我得到以下错误:

2017-09-24 17:29:12.975967: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: TITAN X (Pascal) 
major: 6 minor: 1 memoryClockRate (GHz) 1.531 
pciBusID 0000:01:00.0 
Total memory: 11.90GiB 
Free memory: 11.17GiB 
2017-09-24 17:29:12.975990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-09-24 17:29:12.975996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y 
2017-09-24 17:29:12.976011: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0) 
2017-09-24 17:29:12.987133: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: TITAN X (Pascal) 
major: 6 minor: 1 memoryClockRate (GHz) 1.531 
pciBusID 0000:01:00.0 
Total memory: 11.90GiB 
Free memory: 11.17GiB 
2017-09-24 17:29:12.987159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-09-24 17:29:12.987165: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y 
2017-09-24 17:29:12.987172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0) 
[2017-09-24 17:29:12,994] Making new env: PongNoFrameskip-v4 
2017-09-24 17:29:13.017845: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 
2017-09-24 17:29:13.022347: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: TITAN X (Pascal) 
major: 6 minor: 1 memoryClockRate (GHz) 1.531 
pciBusID 0000:01:00.0 
Total memory: 11.90GiB 
Free memory: 104.81MiB 
2017-09-24 17:29:13.022394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-09-24 17:29:13.022415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y 
2017-09-24 17:29:13.022933: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0) 
2017-09-24 17:29:13.026338: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 104.81M (109903872 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 

(这是唯一的错误消息的第一部分,它是非常长的,但是我觉得这个开头部分是看最重要的事情。)

但是,如果我使用mpirun -np 1运行该命令。

我在网上搜索,我发现了一个repository from Uber它说,“要与4个GPU的机器上运行”我需要使用:

$ mpirun -np 4 python train.py 

我只是想确认mpirun -np X意味着X有限通过机器上GPU的数量,假设我们正在运行的是TensorFlow程序。

回答

0

在阅读了关于MPI的更多信息后,我可以肯定的是,的确,进程的数量受到GPU数量的限制。理由:

  • mpirun -np X命令将运行代码(但每个都有自己的排名)的X“副本”。 See the documentation here
  • 每次运行的代码都需要GPU
  • TensorFlow只允许一个程序一次使用一个GPU。换句话说,您不能同时运行python tf_program1.pypython tf_program2.py,而他们都使用TensorFlow并需要您的机器上使用单独的GPU。

因此,它看起来像我被迫使用一个进程。