首页 > 解决方案 > 在 python 多进程中调用多个 H2O 数据帧作为输入

问题描述

我正在尝试在 python(版本 3.6.8)中使用多处理,其中将调用 2 个表作为函数的输入。在函数中,我通过 h2o 拟合模型。这是我的代码:

def innerFold(params, feats, target, h2o_train_inner, h2o_test_inner, outer_fold, inner_fold):

    inner_scores = []

    for param in params:

        counter = param.get('counter')
        del param['counter']

        print('parameter combination: ', param)
        print('COUNTER: ', counter)

        #define model and fit
        gbm = H2OGradientBoostingEstimator(stopping_rounds = 5,
                                           stopping_metric = 'rmse',
                                           stopping_tolerance = 1e-4,
                                           seed = random_state,
                                           **param)

        print('TRAINING STARTS....')
        gbm.train(x = feats,
                  y = target,
                  training_frame = h2o_train_inner)


        score = gbm.model_performance(h2o_test_inner).r2()
        pd_scores = pd_scores.append({'outer_fold': int(outer_fold),
                                      'inner_fold': int(inner_fold),
                                      'score': score,
                                      'param_idx': int(counter)},
                                     ignore_index=True)

        inner_scores.append(gbm.model_performance(h2o_test_inner).r2())

    return pd_scores

我将 'params' 参数分成 5 个,以便由 5 个不同的内核处理,其余参数应该相同。为此,我将“参数”拆分如下:

df0 = np.array_split(param_combs,5)[0] 
df1 = np.array_split(param_combs,5)[1] 
df2 = np.array_split(param_combs,5)[2] 
df3 = np.array_split(param_combs,5)[3] 
df4 = np.array_split(param_combs,5)[4]

并将它们介绍如下:

args = [(df0, feats, target, h2o_train_inner, h2o_test_inner, outer_fold, inner_fold), 
        (df1, feats, target, h2o_train_inner, h2o_test_inner, outer_fold, inner_fold), 
        (df2, feats, target, h2o_train_inner, h2o_test_inner, outer_fold, inner_fold),
        (df3, feats, target, h2o_train_inner, h2o_test_inner, outer_fold, inner_fold),
        (df4, feats, target, h2o_train_inner, h2o_test_inner, outer_fold, inner_fold)]

,其中feats、target、h2o_train_inner、h2o_test_inner、outer_fold、inner_fold分别是列表类型(包含字典)、字符串、h2o数据帧、h2o数据帧、int、int。

最终,我开始如下过程:

p = mp.Pool(processes=5)
pool_results = p.starmap(innerFold, args)

我得到:

TypeError:()缺少1个必需的位置参数:'keyvals'

似乎参数的数量还可以。我在这里想念什么?

编辑:显然问题源于 H2O 数据帧。如果我将它们转换为 pandas df,它可以工作。知道如何直接使用 H2O df 吗?

EDIT2:据我了解,发送给函数的参数(例如上面的innerFold)是腌制的。由于无法腌制 h2o 对象,因此该函数在转换为 pandas df 后起作用。

标签: python-3.xmultiprocessingh2o

解决方案


推荐阅读