python - python multiprocessing child process cannot access to global variable
问题描述
I created a global variable of pandas dataframe. I expected the child processes can access to the global dataframe, but it seems that the child process cannot get the global variable.
data = pd.DataFrame(data = np.array([[i for i in range(1000)] for j in range(500)]))
def get_sample(i):
print("start round {}".format(i))
sample = data.sample(500, random_state=i)
xs = sample.sum(axis=0)
if i < 10:
print(data.shape())
print(sample.iloc[:3, :3])
print("rount{} returns output".format(i))
return xs
samples = []
def collect(result):
print("collect called with {}".format(result[0][0].shape))
global samples
samples.extend(result)
ntasks = 1000
if __name__=='__main__':
samples = []
xs = pd.DataFrame()
"""sampling"""
pool = mp.Pool(cpu_count(logical=True))
print("start sampling, total round = {}".format(ntasks))
r = pool.map_async(get_sample, [j for j in range(ntasks)], callback=collect)
r.wait()
pool.close()
pool.join()
xs = pd.concat([sample for sample in samples], axis = 1, ignore_index=True)
xs = xs.transpose()
print("xs: ")
print(xs.shape)
print(xs.iloc[:10, :10])
The global dataframe is data. I expected in each child process, the function get_sample can access to data and retrieve some value from data. To make sure child process can get data, I print out the shape of data at each child process. the problem is that it seems the child process cannot get data, because when I run it, there's no print out of data's shape nor partial of sample.
Furthermore, I received error: Traceback (most recent call last): File "sampling2c.py", line 51, in xs = pd.concat([sample for sample in samples], axis = 1, ignore_index=True) File "/usr/usc/python/3.6.0/lib/python3.6/site-packages/pandas/tools/merge.py", line 1451, in concat copy=copy) File "/usr/usc/python/3.6.0/lib/python3.6/site-packages/pandas/tools/merge.py", line 1484, in init raise ValueError('No objects to concatenate') it seems the get_sample function didn't return anything, the sampling failed.
However, when I did a experiment to test whether child processes can access to global variable, it works.
df = pd.DataFrame(data = {'a':[1,2,3], 'b':[2,4,6]})
df['c1'] = [1,2,1]
df['c2'] = [2,1,2]
df['c3'] = [3,4,4]
df2 = pd.DataFrame(data = {'a':[i for i in range(100)], 'b':[i for i in range(100, 200)]})
l = [1, 2, 3]
Mgr = Manager()
results = []
def collect(result):
global results
#print("collect called with {}".format(result))
results.extend(result)
counter = 12
def sample(i):
print(current_process())
return df2.sample(5, random_state = i)
if __name__=='__main__':
pool = Pool(3)
r = pool.map_async(sample, [i for i in range(3)], callback = collect) #callback = collect
r.wait()
for res in results:
print(res)
Each child process can access to the global variable df2. I'm not sure why the child processes cannot access data in the first block of code.
解决方案
When you spawn a process using multiprocessing, your new process gets a copy of the state at the time of spawning.
If you want to communicate data between your parent process or other sibling processes, you can do so using shared variables or a server process that handles shared objects. For details, see sharing-state-between-processes
If you instead use threading, the individual threads all run in the same context, sharing all global variables. So you can access all global variables in all threads and the main loop without having to do anything special.
Both, threading and multiprocessing, have their advantages and disadvantages, but this is not the place to discuss these.
推荐阅读
- node.js - 如何在节点 OIDC 提供者中获取授权码
- python - __name__ 是类级别的变量吗?
- javascript - 当我输入我的 react js 组件时它运行良好但是当我重新加载浏览器时它给我错误无法读取未定义的属性“值”
- java - 如何在日期选择器上提醒剩余天数已完成?
- c++ - 大容量存储挂载点和供应商/设备 ID
- reactjs - ReactJS:三元运算符内的映射函数
- vba - VBA - 根据长度拆分字符串
- haskell - 无法将预期类型“MultTree b”与“[MultTree b]”匹配
- mule - 我需要帮助使用 xml 格式的 dataweave 输出“大于”和“小于”字符
- loss - val loss和train loss的区别