2017-04-22 132 views
0

远投: 的Ubuntu 16.04的Nvidia 1070 8Gig在船上?该机拥有64千兆的RAM和数据集为1万条记录和当前的CUDA,CDNN库,TensorFlow 1.0的Python 3.6TensorFlow Nvidia 1070 GPU内存分配错误如何排除故障?

不知道如何解决?

我一直在努力得到一些车型了TensorFlow并已运行到这一现象多次:我不知道以外的任何其他TensorFlow使用GPU内存?

I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate (GHz) 1.645 pciBusID 0000:01:00.0 Total memory: 7.92GiB Free memory: 7.56GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0) E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 7.92G (8499298304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 1.50G (1614867456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 1.50G (1614867456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY E tensorflow/stream_executor/cuda/cu

我得到这个下面这表明某种内存分配是怎么回事?但仍然失败。

`I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 899200000 totalling 4.19GiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1649756928 totalling 1.54GiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.40GiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats: 
Limit:     8499298304 
InUse:     6875780608 
MaxInUse:    6878976000 
NumAllocs:      338 
MaxAllocSize:   1649756928 

W tensorflow/core/common_runtime/bfc_allocator.cc:274] ******************************************************************************************xxxxxxxxxx 
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 6.10MiB. See logs for memory state. 
W tensorflow/core/framework/op_kernel.cc:993] Internal: Dst tensor is not initialized. 
    [[Node: linear/linear/marital_status/marital_status_weights/embedding_lookup_sparse/strided_slice/_1055 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1643_linear/linear/marital_status/marital_status_weights/embedding_lookup_sparse/strided_slice", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/gpu:0"]()]] 

` 更新:我从减少数以百万计的记录计数至40,000有一个基本模型运行至结束。我仍然收到一条错误消息,但不是连续的。我在模型输出中获得了一堆文本,提示重构模型,我怀疑数据结构是问题的一个重要部分。仍然可以使用一些更好的提示如何调试的全过程..下面是剩下的控制台输出

I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX 1070 
major: 6 minor: 1 memoryClockRate (GHz) 1.645 
pciBusID 0000:01:00.0 
Total memory: 7.92GiB 
Free memory: 7.52GiB 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0) 
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 7.92G (8499298304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 
[I 09:13:09.297 NotebookApp] Saving file at /Documents/InfluenceH/Working_copies/Cond_fcast_wkg/TensorFlow+DNNLinearCombinedClassifier+for+Influence.ipynb 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0) 
+0

这个未回答的问题很相似:http://stackoverflow.com/questions/42495930/tensorflow-oom-on-gpu?rq=1 – dartdog

回答

1

我认为这个问题是TensorFlow尝试分配GPU内存7.92GB,而只有7.56GB是实际上免费。我不能告诉你是因为什么原因在GPU内存的其余部分被占领,但你可能会通过限制GPU内存程序允许分配的分数避免这个问题:

sess_config = tf.ConfigProto() 
sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9 
with tf.Session(config=sess_config, ...) as ...: 

有了这个,程序会只分配90%的GPU内存,即7.13GB。

+0

没有得到什么应该在...的地方在最后一行?另请参阅我的更新... – dartdog

+1

圆括号之间的圆点可以用一些其他选项替换,这些选项用于初始化tf.Session()。这些选项应该是您可能已经指定的选项,如果有的话。如果您没有更多规格,请删除逗号和点。之前“:”你的定义,你会调用tf.Session(),例如'用tf.Session(配置= sess_config)作为SESS名称:' – ml4294

+0

很大的帮助!仍然需要重新构建我认为的模型..但已经超过了最初的错误 – dartdog