首页 > 解决方案 > python multiprocessing child process cannot access to global variable

问题描述

I created a global variable of pandas dataframe. I expected the child processes can access to the global dataframe, but it seems that the child process cannot get the global variable.

data = pd.DataFrame(data = np.array([[i for i in range(1000)] for j in range(500)]))

def get_sample(i):
    print("start round {}".format(i))
    sample = data.sample(500, random_state=i)
    xs = sample.sum(axis=0)
    if i < 10:
        print(data.shape())
        print(sample.iloc[:3, :3])
    print("rount{} returns output".format(i))
    return xs

samples = []
def collect(result):
    print("collect called with {}".format(result[0][0].shape))
    global samples
    samples.extend(result)

ntasks = 1000
if __name__=='__main__':
    samples = []
    xs = pd.DataFrame()
    """sampling"""
    pool = mp.Pool(cpu_count(logical=True))
    print("start sampling, total round = {}".format(ntasks))
    r = pool.map_async(get_sample, [j for j in range(ntasks)], callback=collect)
    r.wait()
    pool.close()
    pool.join()

    xs = pd.concat([sample for sample in samples], axis = 1, ignore_index=True)
    xs = xs.transpose()

    print("xs: ")
    print(xs.shape)
    print(xs.iloc[:10, :10])

The global dataframe is data. I expected in each child process, the function get_sample can access to data and retrieve some value from data. To make sure child process can get data, I print out the shape of data at each child process. the problem is that it seems the child process cannot get data, because when I run it, there's no print out of data's shape nor partial of sample.

Furthermore, I received error: Traceback (most recent call last): File "sampling2c.py", line 51, in xs = pd.concat([sample for sample in samples], axis = 1, ignore_index=True) File "/usr/usc/python/3.6.0/lib/python3.6/site-packages/pandas/tools/merge.py", line 1451, in concat copy=copy) File "/usr/usc/python/3.6.0/lib/python3.6/site-packages/pandas/tools/merge.py", line 1484, in init raise ValueError('No objects to concatenate') it seems the get_sample function didn't return anything, the sampling failed.

However, when I did a experiment to test whether child processes can access to global variable, it works.

df = pd.DataFrame(data = {'a':[1,2,3], 'b':[2,4,6]})
df['c1'] = [1,2,1]
df['c2'] = [2,1,2]
df['c3'] = [3,4,4]

df2 = pd.DataFrame(data = {'a':[i for i in range(100)], 'b':[i for i in range(100, 200)]})
l = [1, 2, 3]
Mgr = Manager()
results = []
def collect(result):
    global results
    #print("collect called with {}".format(result))
    results.extend(result)

counter = 12
def sample(i):
    print(current_process())
    return df2.sample(5, random_state = i)

if __name__=='__main__':
    pool = Pool(3)
    r = pool.map_async(sample, [i for i in range(3)], callback = collect) #callback = collect
    r.wait()
for res in results:
    print(res)

Each child process can access to the global variable df2. I'm not sure why the child processes cannot access data in the first block of code.

标签: pythonmultiprocessing

解决方案


When you spawn a process using multiprocessing, your new process gets a copy of the state at the time of spawning.

If you want to communicate data between your parent process or other sibling processes, you can do so using shared variables or a server process that handles shared objects. For details, see sharing-state-between-processes

If you instead use threading, the individual threads all run in the same context, sharing all global variables. So you can access all global variables in all threads and the main loop without having to do anything special.

Both, threading and multiprocessing, have their advantages and disadvantages, but this is not the place to discuss these.


推荐阅读