如何安全地终止在多个GPU上运行的张量流程程序

我已经实现了一个使用张量流的网络。该网络在4个GPU上进行了培训。当我点击ctrl + c时，程序崩溃了nvidia驱动程序并创建了名为“python”的僵尸进程。我无法杀死僵尸进程，我也不能通过sudo reboot重新启动Ubuntu系统。如何安全地终止在多个GPU上运行的张量流程程序

我正在使用FIFO队列和线程从二进制文件读取数据。

coord = tf.train.Coordinator() 
t = threading.Thread(target=load_and_enqueue, args=(sess,enqueue_op, coord)) 
t.start()

我打电话sess.close()后，程序将不会停止，我看到：

I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=4033 evicted_count=3000 eviction_rate=0.743863 and unsatisfied allocation rate=0 
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=14033 evicted_count=13000 eviction_rate=0.926388 and unsatisfied allocation rate=0

看来GPU资源不会被释放。如果我打开另一个终端，nvidia-smi命令将不起作用。然后，我必须通过惨遭重启系统：

#echo 1 > /proc/sys/kernel/sysrq 
#echo b > /proc/sysrq-trigger

我知道sess.close可能是太残酷。所以我试着用dequeue操作清空FIFO队列，然后：

while iteration < 10000: 
    GPU training... 

#training finished 

coord.request_stop() 
while sess.run(queue_size) > 0: 
    sess.run(dequeue_one_element_op) 
    print('queue_size='+str(sess.run(get_queue_size_op))) 
    time.sleep(1) 
coord.join([t]) 
print('finished join t')

这个方法也不行。基本上，程序在达到最大训练迭代后不能终止。

来源

2016-01-21 read Read

你找到解决这个问题？我甚至不使用FIFO队列或单独的线程，仍然有这个问题。 – Adi

@Adi号我最终没有使用多个GPU。 :( –

https://github.com/tensorflow/tensorflow/issues/658

这解决了这个问题：

export CUDA_VISIBLE_DEVICES=0

来源

2016-03-12 01:25:43

实际上这并不能解决问题，你的方法会限制程序只使用一个GPU，但我想加快多个GPU –

如何安全地终止在多个GPU上运行的张量流程程序

回答

相关问题