首页 > 解决方案 > 如何将模型 (GLM) 从 h2o 移植到 scikit-learn?

问题描述

我正在尝试训练 ML 算法来预测一些数据(实数)。

我成功地使用 h2o automl 找到了一个几乎完美地预测我的变量的模型(16k+ 测试观察中的最大误差为 < 0.15%)。领导者模型是 GLM,我可以使用 h2o python API 检查其内部结构。

现在我想使用 scikit-learn 和 pandas 重现该模型,因为这些是我在项目的其他部分中大量使用的库。

这里的任何人都可以帮助我如何做这个端口,所以我将来不需要使用 h2o 吗?

这是 h2o 模型的样子:

Model Details
=============
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  GLM_1_AutoML_20200611_172640


GLM Model: summary

        family  link    regularization  lambda_search   number_of_predictors_total  number_of_active_predictors     number_of_iterations    training_frame
0       gaussian    identity    Ridge ( lambda = 2.52E-5 )  nlambda = 30, lambda.max = 15.647, lambda.min = 7.073E-4, lambda.1...   32  32  30  automl_training_py_2_sid_b7bf



ModelMetricsRegressionGLM: glm
** Reported on train data. **

MSE: 3.339037335446298e-07
RMSE: 0.0005778440391183678
MAE: 0.00034964760501650187
RMSLE: 0.000428181614223859
R^2: 0.9999869058722919
Mean Residual Deviance: 3.339037335446298e-07
Null degrees of freedom: 5470919
Residual degrees of freedom: 5470887
Null deviance: 139509.91273673013
Residual deviance: 1.826760613923986
AIC: -66058752.72303456

ModelMetricsRegressionGLM: glm
** Reported on cross-validation data. **

MSE: 3.8803522694657527e-07
RMSE: 0.0006229247361813266
MAE: 0.0003746132971400872
RMSLE: 0.00046256895714739585
R^2: 0.9999847830907341
Mean Residual Deviance: 3.8803522694657527e-07
Null degrees of freedom: 5470919
Residual degrees of freedom: 5470887
Null deviance: 139510.0902560055
Residual deviance: 2.1229096838065575
AIC: -65236783.11156034

Cross-Validation Metrics Summary: 

        mean    sd  cv_1_valid  cv_2_valid  cv_3_valid  cv_4_valid  cv_5_valid
0   mae     3.6770012E-4    4.1263074E-6    3.6505258E-4    3.706906E-4     3.6763187E-4    3.7266721E-4    3.6245832E-4
1   mean_residual_deviance  3.7148237E-7    6.667412E-9     3.680999E-7     3.7594532E-7    3.6979083E-7    3.802558E-7     3.6332003E-7
2   mse     3.7148237E-7    6.667412E-9     3.680999E-7     3.7594532E-7    3.6979083E-7    3.802558E-7     3.6332003E-7
3   null_deviance   27902.018   24.152164   27880.328   27876.885   27900.943   27932.406   27919.527
4   r2  0.99998546  2.5938294E-7    0.9999856   0.9999852   0.9999855   0.9999851   0.99998575
5   residual_deviance   0.4064701   0.007295375     0.40276903  0.41135335  0.40461922  0.41606984  0.397539
6   rmse    6.094739E-4     5.466305E-6     6.0671236E-4    6.131438E-4     6.081043E-4     6.166489E-4     6.0276035E-4
7   rmsle   4.5219096E-4    3.5735677E-6    4.508027E-4     4.5445774E-4    4.5059595E-4    4.5708878E-4    4.4800964E-4


Scoring History: 

        timestamp   duration    iteration   lambda  predictors  deviance_train  deviance_test   deviance_xval   deviance_se
0       2020-06-11 17:29:06     0.000 sec   1   .16E2   33  0.014121    NaN     0.015569    7.637899e-06
1       2020-06-11 17:29:07     0.596 sec   2   .97E1   33  0.010911    NaN     0.012416    7.202663e-06
2       2020-06-11 17:29:08     1.141 sec   3   .6E1    33  0.007864    NaN     0.009245    6.969671e-06
3       2020-06-11 17:29:08     1.748 sec   4   .37E1   33  0.005319    NaN     0.006438    6.863783e-06
4       2020-06-11 17:29:09     2.294 sec   5   .23E1   33  0.003427    NaN     0.004231    6.587463e-06
5       2020-06-11 17:29:09     2.851 sec   6   .14E1   33  0.002145    NaN     0.002679    5.981752e-06
6       2020-06-11 17:29:10     3.437 sec   7   .9E0    33  0.001332    NaN     0.001665    5.137335e-06
7       2020-06-11 17:29:10     3.983 sec   8   .56E0   33  0.000835    NaN     0.001036    4.157958e-06
8       2020-06-11 17:29:11     4.563 sec   9   .35E0   33  0.000531    NaN     0.000654    3.186066e-06
9       2020-06-11 17:29:12     5.165 sec   10  .21E0   33  0.000343    NaN     0.000417    2.279007e-06
10      2020-06-11 17:29:12     5.690 sec   11  .13E0   33  0.000220    NaN     0.000269    1.508395e-06
11      2020-06-11 17:29:13     6.277 sec   12  .83E-1  33  0.000141    NaN     0.000174    9.034596e-07
12      2020-06-11 17:29:13     6.853 sec   13  .51E-1  33  0.000090    NaN     0.000111    4.893595e-07
13      2020-06-11 17:29:14     7.396 sec   14  .32E-1  33  0.000057    NaN     0.000071    2.314916e-07
14      2020-06-11 17:29:14     7.990 sec   15  .2E-1   33  0.000035    NaN     0.000044    1.386863e-07
15      2020-06-11 17:29:15     8.560 sec   16  .12E-1  33  0.000021    NaN     0.000027    3.953892e-08
16      2020-06-11 17:29:16     9.146 sec   17  .77E-2  33  0.000012    NaN     0.000016    3.271502e-08
17      2020-06-11 17:29:16     9.753 sec   18  .48E-2  33  0.000007    NaN     0.000009    3.171957e-08
18      2020-06-11 17:29:17     10.309 sec  19  .3E-2   33  0.000004    NaN     0.000005    1.575243e-08
19      2020-06-11 17:29:17     10.875 sec  20  .18E-2  33  0.000002    NaN     0.000003    1.565016e-08


See the whole table with table.as_data_frame()

看来我应该能够使用 Ridge 或 Lasso(或者可能是 Tweedy?)获得相同的结果,但我尝试了几个参数并得到了糟糕的结果。

谁能帮我?我已经阅读了h2o 文档scikit 文档,但不知道如何继续。

标签: machine-learningscikit-learnlinear-regressionh2o

解决方案


这是相关的,并已在此处得到解答:https ://stackoverflow.com/a/68370871/17441922

Sklearn 基于 Python/Cython/C,H2O 使用 Java。底层算法也可能不同。但是,您可以尝试在两者之间匹配/转换您的超参数,因为它们会相似


推荐阅读