2016-10-05 63 views
1

语境量大僵局

我需要运行一个multiprocessing.ThreadPool内multiprocessing.Process。 起初似乎很奇怪,但它是我发现处理segfault的唯一方法,可能会发生,因为我正在使用C++共享库。 如果一个段错误追加,进程被终止,我可以检查process.exitcode并处理它。

问题

过了一会儿,当我试图加入这一进程死锁追加。

下面是一个简单的版本,我的代码:

import sys, time, multiprocessing 
from multiprocessing.pool import ThreadPool 

def main(): 
    # Launch 8 workers 
    pool = ThreadPool(8) 
    it = pool.imap(run, range(500)) 
    while True: 
     try: 
      it.next() 
     except StopIteration: 
      break 

def run(value): 
    # Each worker launch it own Process 
    process = multiprocessing.Process(target=run_and_might_segfault,  args=(value,)) 
    process.start() 

    while process.is_alive(): 
     sys.stdout.write('.') 
     sys.stdout.flush() 
     time.sleep(0.1) 

    # Will never join after a while, because of a mystery deadlock 
    process.join() 

    # Deals with process.exitcode to log errors 

def run_and_might_segfault(value): 
    # Load a shared library and do stuff (could throw c++ exception, segfault ...) 
    print(value) 

if __name__ == '__main__': 
    main() 

这里是一个可能的输出:

➜ ~ python m.py 
..0 
1 
........8 
.9 
.......10 
......11 
........12 
13 
........14 
........16 
........................................................................................ 

正如你所看到的,process.is_alive()几次迭代后常是真实的,过程中会绝不加入。

如果我CTRL-C的脚本得到这个堆栈跟踪:

Traceback (most recent call last): 
    File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py", line 680, in next 
    item = self._items.popleft() 
IndexError: pop from an empty deque 

During handling of the above exception, another exception occurred: 

Traceback (most recent call last): 
    File "m.py", line 30, in <module> 
    main() 
    File "m.py", line 9, in main 
    it.next() 
    File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5 /lib/python3.5/multiprocessing/pool.py", line 684, in next 
    self._cond.wait(timeout) 
    File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5 /lib/python3.5/threading.py", line 293, in wait 
    waiter.acquire() 
KeyboardInterrupt 

Error in atexit._run_exitfuncs: 
Traceback (most recent call last): 
    File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5 /lib/python3.5/multiprocessing/popen_fork.py", line 29, in poll 
    pid, sts = os.waitpid(self.pid, flag) 
KeyboardInterrupt 

PS 在MacOS使用Python 3.5.2。

各种帮助表示感谢,谢谢。

编辑

我尝试使用Python 2.7版,它运作良好。可能只是一个python 3.5问题?

回答

4

该问题也在CPython的最新版本 - Python 3.7.0a0 (default:4e2cce65e522, Oct 13 2016, 21:55:44)上转载。

如果attach与GDB卡住的过程之一,你会发现它正试图在sys.stdout.flush()调用获取锁:

(gdb) py-list 
263    import traceback 
264    sys.stderr.write('Process %s:\n' % self.name) 
265    traceback.print_exc() 
266   finally: 
267    util.info('process exiting with exitcode %d' % exitcode) 
>268    sys.stdout.flush() 
269    sys.stderr.flush() 
270 
271   return exitcode 

Python的水平回溯看起来是这样的:

(gdb) py-bt 
Traceback (most recent call first): 
    File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/process.py", line 268, in _bootstrap 
    sys.stdout.flush() 
    File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/popen_fork.py", line 74, in _launch 
    code = process_obj._bootstrap() 
    File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/popen_fork.py", line 20, in __init__ 
    self._launch(process_obj) 
    File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/context.py", line 277, in _Popen 
    return Popen(process_obj) 
    File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/context.py", line 223, in _Popen 
    return _default_context.get_context().Process._Popen(process_obj) 
    File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/process.py", line 105, in start 
    self._popen = self._Popen(self) 
    File "deadlock.py", line 17, in run 
    process.start() 
    File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/pool.py", line 119, in worker 
    result = (True, func(*args, **kwds)) 
    File "/home/rpodolyaka/src/cpython/Lib/threading.py", line 864, in run 
    self._target(*self._args, **self._kwargs) 
    File "/home/rpodolyaka/src/cpython/Lib/threading.py", line 916, in _bootstrap_inner 
    self.run() 
    File "/home/rpodolyaka/src/cpython/Lib/threading.py", line 884, in _bootstrap 
    self._bootstrap_inner() 

在翻译水平,它看起来像:

(gdb) frame 6 

(gdb) list 
287  return 0; 
288 } 
289 relax_locking = (_Py_Finalizing != NULL); 
290 Py_BEGIN_ALLOW_THREADS 
291 if (!relax_locking) 
292  st = PyThread_acquire_lock(self->lock, 1); 
293 else { 
294  /* When finalizing, we don't want a deadlock to happen with daemon 
295   * threads abruptly shut down while they owned the lock. 
296   * Therefore, only wait for a grace period (1 s.). ... */ 

(gdb) p /x self->lock 
$1 = 0xd25ce0 

(gdb) p /x self->owner 
$2 = 0x7f9bb2128700 

注,即从这一特定的子进程的锁仍然在父进程中的一个线程(LWP 1105)所拥有的一点:

(gdb) info threads 
    Id Target Id   Frame 
* 1 Thread 0x7f9bb5559440 (LWP 1102) "python" 0x00007f9bb5157577 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, 
    futex_word=0xe4d340) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
    2 Thread 0x7f9bb312a700 (LWP 1103) "python" 0x00007f9bb4780253 in select() at ../sysdeps/unix/syscall-template.S:84 
    3 Thread 0x7f9bb2929700 (LWP 1104) "python" 0x00007f9bb4780253 in select() at ../sysdeps/unix/syscall-template.S:84 
    4 Thread 0x7f9bb2128700 (LWP 1105) "python" 0x00007f9bb4780253 in select() at ../sysdeps/unix/syscall-template.S:84 
    5 Thread 0x7f9bb1927700 (LWP 1106) "python" 0x00007f9bb4780253 in select() at ../sysdeps/unix/syscall-template.S:84 
    6 Thread 0x7f9bb1126700 (LWP 1107) "python" 0x00007f9bb4780253 in select() at ../sysdeps/unix/syscall-template.S:84 
    7 Thread 0x7f9bb0925700 (LWP 1108) "python" 0x00007f9bb4780253 in select() at ../sysdeps/unix/syscall-template.S:84 
    8 Thread 0x7f9b9bfff700 (LWP 1109) "python" 0x00007f9bb4780253 in select() at ../sysdeps/unix/syscall-template.S:84 
    9 Thread 0x7f9b9b7fe700 (LWP 1110) "python" 0x00007f9bb4780253 in select() at ../sysdeps/unix/syscall-template.S:84 
    10 Thread 0x7f9b9affd700 (LWP 1111) "python" 0x00007f9bb4780253 in select() at ../sysdeps/unix/syscall-template.S:84 
    11 Thread 0x7f9b9a7fc700 (LWP 1112) "python" 0x00007f9bb5157577 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, 
    futex_word=0x7f9b80001ed0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
    12 Thread 0x7f9b99ffb700 (LWP 1113) "python" 0x00007f9bb5157577 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, 
    futex_word=0x7f9b84001bb0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 

因此,这的确是一个僵局,它发生是由于事实,你在原始进程中同时执行多个sys.stdout多个 线程的写入和刷新,同时还创建子进程 - 性质为fork(2)系统调用 子级继承父内存,包括获取的锁:fork()调用必须在获取锁的同时执行,并且即使父进程最终释放它,孩子们也不会看到,因为他们每个人都有自己的内存空间复制写入。

因此,你需要混合 多线程多处理,并确保所有的锁都fork()之前正确释放,如果它们要 在孩子过程中使用时要非常小心。

它非常类似于在http://bugs.python.org/issue6721

说明。此外,如果你从你的片段删除与sys.stdout的相互作用,它会正常工作。