关于如何让numpy
使用多核(在Intel硬件上)来处理诸如内部和外部矢量乘积,矢量矩阵乘法等操作,技术水平如何?多核硬件上的numpy
如果有必要,我很乐意重建numpy
,但是现在我正在寻找方法来加快速度而不更改我的代码。
仅供参考,我的show_config()
如下,我从来没有看到numpy
使用一个以上的核心:
atlas_threads_info:
libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/local/atlas-3.9.16/lib']
language = f77
include_dirs = ['/usr/local/atlas-3.9.16/include']
blas_opt_info:
libraries = ['ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/local/atlas-3.9.16/lib']
define_macros = [('ATLAS_INFO', '"\\"3.9.16\\""')]
language = c
include_dirs = ['/usr/local/atlas-3.9.16/include']
atlas_blas_threads_info:
libraries = ['ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/local/atlas-3.9.16/lib']
language = c
include_dirs = ['/usr/local/atlas-3.9.16/include']
lapack_opt_info:
libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/local/atlas-3.9.16/lib']
define_macros = [('ATLAS_INFO', '"\\"3.9.16\\""')]
language = f77
include_dirs = ['/usr/local/atlas-3.9.16/include']
lapack_mkl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE
我怀疑你可以通过向量的大小为4000的多点向量计算fo dot产品来实现任何加速。这种点积仅需要几微秒来计算。将任务分配给单独线程的开销可能至少会使您可能获得的任何速度变为无效,即使在使用线程池时也是如此。 – 2011-05-13 20:55:09
我用(4k ... 1.5M)x矩阵乘以32M x(4k ... 1.5M)矩阵,并尝试使用多处理工具箱来实现,不过这似乎会产生大量内存开销,因为数据被复制到新的进程(感谢GIL)。如果所有8个核心都被地图集使用,那将会很棒。 – Herbert 2015-06-30 14:07:16