首页 > 解决方案 > 多处理池管理器命名空间 EOF 错误

问题描述

当我使用 pool.manager.namespace 共享一个 pandas 数据帧,并且每个目标函数都会调用 .sample(5000) 到这个数据帧时,会发生 EOF 错误。

def get_sample(i):
    print("start round {}".format(i))
    sample = sharedData.data.sample(5000, random_state=i)

if __name__=='__main__':
    with mp.Pool(cpu_count(logical=False)) as pool0:
        results = pool0.map(load_data, paths)
        sharedData.data = pd.concat(results, axis=0, copy=False)
        genes = sharedData.data.columns
        pool0.close()
        pool0.join()
        del results

    """sampling"""
    with mp.Pool(cpu_count(logical=True)) as pool:
        print("start sampling, total round = {}".format(1000))
        r = pool.map_async(get_sample, [j for j in range(1000)], error_callback=my_error)
        results2 = r.get()
        pool.close()
        pool.join()

有追溯:

start round 145
round35 returns output
round18 returns output
rount161 returns output
start round 704
start round 720
start round 736
start round 752
start round 768
start round 784
start round 800
start round 816
start round 832
start round 848
start round 864
start round 880
start round 896
start round 912
start round 928
start round 944
start round 960
start round 976
start round 992
from error_callback: 

multiprocessing.pool.RemoteTraceback: 
multiprocessing.pool.RemoteTraceback: 
"""

Traceback (most recent call last):
  File "/usr/usc/python/3.6.0/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/usc/python/3.6.0/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "sampling2temp.py", line 38, in get_sample_ys
    sample = sharedData.data.sample(5000, random_state=i)
  File "/usr/usc/python/3.6.0/lib/python3.6/multiprocessing/managers.py", line 1060, in __getattr__
    return callmethod('__getattribute__', (key,))
  File "/usr/usc/python/3.6.0/lib/python3.6/multiprocessing/managers.py", line 757, in _callmethod
    kind, result = conn.recv()
  File "/usr/usc/python/3.6.0/lib/python3.6/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/usc/python/3.6.0/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/usc/python/3.6.0/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "sampling2temp.py", line 105, in <module>
    results2 = r.get()
  File "/usr/usc/python/3.6.0/lib/python3.6/multiprocessing/pool.py", line 608, in get
    raise self._value
EOFError

似乎任务 704 到 992 根本没有返回任何输出,然后 Manager 进程关闭。因此,当其中一个正在运行的任务从 manager.namespace.data 读取数据时,它会收到 EOF。

顺便说一句,如果我将 sample(5000) 更改为 sample(2500) 并将 Manager.Namespace.data 的大小从 2127096024 字节更改为 1738281624 字节,则没有 EOF 问题。那是因为每个工人都使用了太多的内存吗?

标签: pythonpandasmultiprocessing

解决方案


如果所有关联的发送方连接都已关闭,则multiprocessing.Connection接收方会抛出 EOFError。

看起来 multiprocessing.Manager 根据堆栈跟踪在后台使用 multiprocessing.Connection 。由于您的代码看起来并没有过早终止管理器进程,因此我认为问题一定是管理器进程遇到异常并在您完成之前终止。由于减少样本大小似乎可以解决问题,因此管理器进程可能会因使用过多内存而被 OOM 杀手杀死- 您可以使用链接文章中建议的命令检查是否是这种情况:

dmesg | egrep -i "killed process"

你会期望看到这样的东西:

host kernel: Out of Memory: Killed process 1234 (python).

推荐阅读