2017-10-18 78 views
0

当前正在运行培训使用Tensorflow中的mnist-deep.py教程在Geforce 1080(8Gb)上使用机器顶部的16GB RAM。所有最新的CUDA库和驱动程序都已安装。一切都在Tensorflow 1.3上运行。 mnist-deep.py脚本一直工作正常,没有任何错误,直到我决定执行一些Keras vdsr培训(https://github.com/jackie840129/VDSR-reduction_with-Keras)的培训。训练挂起和GPU丢失(无法通过nvidia-smi访问)。重新启动后,试图执行mnist-deep.py并不断获取下面的错误。我仍然不清楚可能导致问题的原因。重新启动,重新安装cuda似乎不能解决问题。重新设计机器似乎可以解决问题,但这似乎没有任何实际的方法。任何想法可能是导致问题的原因,以及如何解决问题?Tensorflow crash in mnist-deep.py

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes. 
Extracting /tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz 
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes. 
Extracting /tmp/tensorflow/mnist/input_data/train-labels-idx1-ubyte.gz 
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes. 
Extracting /tmp/tensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz 
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes. 
Extracting /tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz 
Saving graph to: /tmp/tmpgb1l75z_ 
2017-10-18 15:36:28.098787: W 
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow 
library wasn't compiled to use SSE4.1 instructions, but these are 
available on your machine and could speed up CPU computations. 
2017-10-18 15:36:28.098807: W 
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow 
library wasn't compiled to use SSE4.2 instructions, but these are 
available on your machine and could speed up CPU computations. 
2017-10-18 15:36:28.098814: W 
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow 
library wasn't compiled to use AVX instructions, but these are 
available on your machine and could speed up CPU computations. 
2017-10-18 15:36:28.098820: W 
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow 
library wasn't compiled to use AVX2 instructions, but these are 
available on your machine and could speed up CPU computations. 
2017-10-18 15:36:28.098825: W 
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow 
library wasn't compiled to use FMA instructions, but these are 
available on your machine and could speed up CPU computations. 
2017-10-18 15:36:28.760202: I 
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful 
NUMA node read from SysFS had negative value (-1), but there must be 
at least one NUMA node, so returning NUMA node zero 
2017-10-18 15:36:28.760643: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 
with properties: 
name: GeForce GTX 1080 
major: 6 minor: 1 memoryClockRate (GHz) 1.7715 
pciBusID 0000:01:00.0 
Total memory: 7.92GiB 
Free memory: 7.81GiB 
2017-10-18 15:36:28.760657: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-10-18 15:36:28.760664: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y 
2017-10-18 15:36:28.760672: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating 
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci 
bus id: 0000:01:00.0) 
2017-10-18 15:36:31.546892: E 
tensorflow/stream_executor/cuda/cuda_driver.cc:1073] failed to get 
elapsed time between events: CUDA_ERROR_NOT_READY 
2017-10-18 15:36:32.547035: E 
tensorflow/stream_executor/cuda/cuda_driver.cc:1073] failed to get 
elapsed time between events: CUDA_ERROR_NOT_READY 
2017-10-18 15:36:32.549299: E 
tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create 
cublas handle: CUBLAS_STATUS_NOT_INITIALIZED 
2017-10-18 15:36:32.549317: W 
tensorflow/stream_executor/stream.cc:1756] attempting to perform BLAS 
operation using StreamExecutor without BLAS support 
Traceback (most recent call last): 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/client/session.py", line 1327, in _do_call 
return fn(*args) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/client/session.py", line 1306, in _run_fn 
status, run_metadata) 
File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__ 
next(self.gen) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/framework/errors_impl.py", line 466, in 
raise_exception_on_not_ok_status 
pywrap_tensorflow.TF_GetCode(status)) 
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM 
launch failed : a.shape=(50, 3136), b.shape=(3136, 1024), m=50, 
n=1024, k=3136 
[[Node: fc1/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, 
transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"] 
(fc1/Reshape, fc1/Variable/read)]] 
[[Node: Mean_1/_7 = _Recv[client_terminated=false, 
recv_device="/job:localhost/replica:0/task:0/cpu:0", 
send_device="/job:localhost/replica:0/task:0/gpu:0", 
send_device_incarnation=1, tensor_name="edge_79_Mean_1", 
tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"] 
()]] 

During handling of the above exception, another exception occurred: 

Traceback (most recent call last): 
File "mnist_deep.py", line 178, in <module> 
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/platform/app.py", line 48, in run 
_sys.exit(main(_sys.argv[:1] + flags_passthrough)) 
File "mnist_deep.py", line 165, in main 
x: batch[0], y_: batch[1], keep_prob: 1.0}) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/framework/ops.py", line 541, in eval 
return _eval_using_default_session(self, feed_dict, self.graph, 
session) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/framework/ops.py", line 4085, in 
_eval_using_default_session 
return session.run(tensors, feed_dict) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/client/session.py", line 895, in run 
run_metadata_ptr) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/client/session.py", line 1124, in _run 
feed_dict_tensor, options, run_metadata) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/client/session.py", line 1321, in _do_run 
options, run_metadata) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/client/session.py", line 1340, in _do_call 
raise type(e)(node_def, op, message) 
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM 
launch failed : a.shape=(50, 3136), b.shape=(3136, 1024), m=50, 
n=1024, k=3136 
[[Node: fc1/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, 
transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"] 
(fc1/Reshape, fc1/Variable/read)]] 
[[Node: Mean_1/_7 = _Recv[client_terminated=false, 
recv_device="/job:localhost/replica:0/task:0/cpu:0", 
send_device="/job:localhost/replica:0/task:0/gpu:0", 
send_device_incarnation=1, tensor_name="edge_79_Mean_1", 
tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"] 
()]] 

Caused by op 'fc1/MatMul', defined at: 
File "mnist_deep.py", line 178, in <module> 
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/platform/app.py", line 48, in run 
_sys.exit(main(_sys.argv[:1] + flags_passthrough)) 
File "mnist_deep.py", line 134, in main 
y_conv, keep_prob = deepnn(x) 
File "mnist_deep.py", line 83, in deepnn 
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/ops/math_ops.py", line 1844, in matmul 
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/ops/gen_math_ops.py", line 1289, in 
_mat_mul 
transpose_b=transpose_b, name=name) 
File "/home/nmh/env/lib/python3.6/site 
/tensorflow/python/framework/op_def_library.py", line 767, in apply_op 
op_def=op_def) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/framework/ops.py", line 2630, in create_op 
original_op=self._default_original_op, op_def=op_def) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/framework/ops.py", line 1204, in __init__ 
self._traceback = self._graph._extract_stack() # pylint: 
disable=protected-access 

InternalError (see above for traceback): Blas GEMM launch failed : 
a.shape=(50, 3136), b.shape=(3136, 1024), m=50, n=1024, k=3136 
[[Node: fc1/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, 
transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"] 
(fc1/Reshape, fc1/Variable/read)]] 
[[Node: Mean_1/_7 = _Recv[client_terminated=false, 
recv_device="/job:localhost/replica:0/task:0/cpu:0", 
send_device="/job:localhost/replica:0/task:0/gpu:0", 
send_device_incarnation=1, tensor_name="edge_79_Mean_1", 
tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"] 
()]] 

回答

0

我得到了同样的错误一次。这是由于内存不足错误(操作系统因为内存而杀死了我的培训),因为这非常暴力,我也失去了与GPU的联系。一些重新启动和删除GPU - 设置它恢复工作。

你可以看看this question来知道你的问题是否相同。如果是这样,您可能不得不使用较小的网络。