2017-09-21 49 views
1

我注意到在我的机器上,tensorflow中的SVD运行速度比numpy慢得多。我有GTX 1080 GPU,期望SVD至少与使用CPU(numpy)运行代码一样快。TensorFlow中的SVD比numpy中的要慢

环境信息

操作系统

lsb_release -a 
No LSB modules are available. 
Distributor ID: Ubuntu 
Description: Ubuntu 16.10 
Release: 16.10 
Codename: yakkety 

CUDA和cuDNN的安装版本:

ls -l /usr/local/cuda-8.0/lib64/libcud* 
-rw-r--r-- 1 root  root 556000 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudadevrt.a 
lrwxrwxrwx 1 root  root  16 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart.so -> libcudart.so.8.0 
lrwxrwxrwx 1 root  root  19 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart.so.8.0 -> libcudart.so.8.0.61 
-rwxr-xr-x 1 root  root 415432 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart.so.8.0.61 
-rw-r--r-- 1 root  root 775162 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart_static.a 
lrwxrwxrwx 1 voldemaro users  13 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn.so -> libcudnn.so.5 
lrwxrwxrwx 1 voldemaro users  18 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn.so.5 -> libcudnn.so.5.1.10 
-rwxr-xr-x 1 voldemaro users 84163560 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn.so.5.1.10 
-rw-r--r-- 1 voldemaro users 70364814 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn_static.a 

TensorFlow设置

python -c "import tensorflow; print(tensorflow.__version__)" 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally 
1.0.0 

代码:

''' 
Created on Sep 21, 2017 

@author: voldemaro 
''' 
import numpy as np 
import tensorflow as tf 
import time; 
import numpy.linalg as NLA; 




N=1534; 

svd_array = np.random.random_sample((N,N)); 
svd_array = svd_array.astype(complex); 

specVar = tf.Variable(svd_array, dtype=tf.complex64); 

[D2, E1, E2] = tf.svd(specVar); 

init_OP = tf.global_variables_initializer(); 

with tf.Session() as sess: 
    # Initialize all tensorflow variables 
    start = time.time(); 
    sess.run(init_OP); 
    print 'initializing variables: {} s'.format(time.time()-start); 

    start_time = time.time(); 
    [d, e1, e2] = sess.run([D2, E1, E2]); 
    print("Tensorflow SVD ---: {} s" . format(time.time() - start_time)); 


# Equivalent numpy 
start = time.time(); 

u, s, v = NLA.svd(svd_array); 
print 'numpy SVD ---: {} s'.format(time.time() - start); 

代码跟踪:

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX 1080 
major: 6 minor: 1 memoryClockRate (GHz) 1.7335 
pciBusID 0000:01:00.0 
Total memory: 7.92GiB 
Free memory: 7.11GiB 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0) 
initializing variables: 0.230546951294 s 
Tensorflow SVD ---: 6.56117296219 s 
numpy SVD ---: 4.41714000702 s 

回答

1

它看起来像TensorFlow运implements gesvd而如果你使用启用MRL-numpy的/ SciPy的(也就是说,如果你使用畅达),则默认为更快(但不太数值鲁棒性)gesdd

您可以尝试针对比较gesvd在SciPy的:

from scipy import linalg 
u0, s0, vt0 = linalg.svd(target0, lapack_driver="gesvd") 

我也经历了MKL版本更好的成绩,所以我一直在使用这种辅助class到TensorFlow和SVD的numpy的版本之间切换透明地使用tf.Variable存储结果

您对缓慢的详细信息使用这样的

result = SvdWrapper(tensor) 
result.update() 
sess.run([result.u, result.s, result.v]) 

问题:https://github.com/tensorflow/tensorflow/issues/13222

1

GPU执行通常优于仅当并行化是有效的CPU。

然而,SVD算法的并行化仍然受到积极的研究,这意味着没有发现并行版本远远优于串行实现。

可能,NumPy版本基于一个非常优化的FORTRAN实现,而我相信TensorFlow有它自己的C++实现,显然这不像NumPy调用的代码那么优化。

编辑:与FORTRAN实现相比,您可能不是第一个观察poorer performances of TensorFlow with SVD的人。

+0

当我分析代码,我看到numpy的是在所有8个CPU内核(英特尔酷睿i7)上分散负载,所以我有点期待看到有这么多(2560)CUDA内核的好处。 – user2109066

+0

看起来像早些时候有一些努力,利用GPU显示5倍的改善,超过英特尔MKL - https://s3.amazonaws.com/academia.edu.documents/30806706/Sheetal09Singular.pdf?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1506052362&Signature=gCpal% 2Fk2dCnhAUXgYE4sgjqPNOo%3D&响应内容处置=直列%3B%20filename%3DSingular_value_decomposition_on_GPU_usin.pdf – user2109066