我假设您可以将整个数据集加载到RAM中的一个numpy数组中,并且您正在使用Linux或Mac。 (如果你使用的是Windows系统,或者你不能将阵列安装到RAM中,那么你应该将阵列复制到磁盘上的一个文件中并使用numpy.memmap来访问它。你的计算机会将数据从磁盘缓存到RAM中可以,并且这些缓存将在进程之间共享,因此它不是一个可怕的解决方案。)
根据上述假设,如果您需要对通过multiprocessing
创建的其他进程的数据集进行只读访问,则可以简单地创建数据集,然后启动其他进程。他们将只能访问原始名称空间中的数据。它们可以改变原始名称空间的数据,但这些更改对其他进程不可见(内存管理器会将它们改变的每一段内存复制到本地内存映射中)。
如果其他进程需要改变原始数据集,并以父进程或其他进程是可见的这些变化,你可以使用这样的事情:
import multiprocessing
import numpy as np
# create your big dataset
big_data = np.zeros((3, 3))
# create a shared-memory wrapper for big_data's underlying data
# (it doesn't matter what datatype we use, and 'c' is easiest)
# I think if lock=True, you get a serialized object, which you don't want.
# Note: you will need to setup your own method to synchronize access to big_data.
buf = multiprocessing.Array('c', big_data.data, lock=False)
# at this point, buf and big_data.data point to the same block of memory,
# (try looking at id(buf[0]) and id(big_data.data[0])) but for some reason
# changes aren't propagated between them unless you do the following:
big_data.data = buf
# now you can update big_data from any process:
def add_one_direct():
big_data[:] = big_data + 1
def add_one(a):
# People say this won't work, since Process() will pickle the argument.
# But in my experience Process() seems to pass the argument via shared
# memory, so it works OK.
a[:] = a+1
print "starting value:"
print big_data
p = multiprocessing.Process(target=add_one_direct)
p.start()
p.join()
print "after add_one_direct():"
print big_data
p = multiprocessing.Process(target=add_one, args=(big_data,))
p.start()
p.join()
print "after add_one():"
print big_data