首页 > 解决方案 > 在 pymc3 逻辑回归中使用 NUTS 调试收敛和慢速采样问题

问题描述

我在同一主题的GitHub 问题上交叉发布。

我正在尝试确定 pymc 中默认 NUTS 算法在 150 万个观测值和 500 个协变量(1500000 x 500 设计矩阵)的数据集上拟合标准逻辑模型的慢采样率和收敛问题的原因。

下面是一些创建这种大小的玩具数据集的代码:

import numpy as np
covariates = np.random.randn(1500000, 499)
covariates = np.hstack((np.ones((1500000, 1)), covariates))

true_coefficients = 5 * np.random.rand(500)
true_logits = np.dot(covariates, true_coefficients) 
true_probs = 1.0 / (1.0 + np.exp(-true_logits))
observed_labels = (np.random.rand(1500000) < true_probs).astype(np.int32)

从这里很容易创建一个statsmodelssklearn.linear_model.LogisticRegression模型拟合,例如使用以下代码:

from sklearn.linear_model import LogisticRegression

logistic = LogisticRegression(max_iter=1000, fit_intercept=False, verbose=1)
logistic.fit(covariates, observed_labels)
# above takes 3-5 minutes on my machine, appears to converge well.

import matplotlib.pyplot as plt
plt.scatter(true_coefficients, logistic.coef_[0, :])
plt.show()
# above scatter plot shows good accuracy in the point estimate.

即使对于完整的数据集,sklearn 模型也能相当快地收敛,并产生一个与真实系数相比具有良好精度的解参数向量。

但是使用下面的 pymc 代码,无论是使用 NUTS 还是简单的 Metropolis 采样器,我都经常看到无法接受的缓慢采样时间。我尝试更改两者的许多默认设置,对数据进行二次采样以及各种其他技巧,但没有运气。

import pymc3 as pm
import theano.tensor as tt

with pm.Model() as logistic_model:
    beta = pm.Normal('beta', 0.0, sd=3.0, shape=500)
    p = 1.0 / (1.0 + tt.exp(-tt.dot(covariates, beta)))
    likelihood = pm.Bernoulli('likelihood', p, observed=observed_labels)

with logistic_model:
    tr = pm.sample(1000, njobs=2, nchains=2)

# In the sample step above I have also tried all types of variations on using
# step = pm.NUTS(), step = pm.Metropolis(), start = pm.find_MAP(),
# start = {'beta': np.zeros(500)}, and adjustment of all sorts of options with
# kwargs to these steppers.

# I have also tried wrapping `covariates` and `observed_labels` with pm.Minibatch
# with batch sizes ranging from 1000 to 200000 observations per sample. With NUTS,
# the minibatches made it run drastically slower, over 40 seconds per sample.

对于 NUTS 算法,我发现如果我不手动将起始值设置为零向量或使用pm.find_MAP(这在 pymc 文档中是不鼓励的),否则我会在初始化过程中得到一个“坏能量”错误:

RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
  File "/Users/espears/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 73, in run
    self._start_loop()
  File "/Users/espears/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 113, in _start_loop
    point, stats = self._compute_point()
  File "/Users/espears/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 139, in _compute_point
    point, stats = self._step_method.step(self._point)
  File "/Users/espears/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/pymc3/step_methods/arraystep.py", line 247, in step
    apoint, stats = self.astep(array)
  File "/Users/espears/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 117, in astep
    'might be misspecified.' % start.energy)
ValueError: Bad initial energy: inf. The model might be misspecified.
"""

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
ValueError: Bad initial energy: inf. The model might be misspecified.

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
~/programming/pymc-testing/logistic_example.py in <module>()
     45 with logistic_model:
     46     #step, start = pm.Metropolis(), {'beta': np.zeros(num_coefficients)}
---> 47     tr = pm.sample(1000, njobs=2, nchains=2)
     48 print(f"{time.time() - st}: finished_sampling...")
     49

~/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/pymc3/sampling.py in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, nuts_kwargs, step_kwargs, progressbar, model, random_seed, live_plot, discard_tuned_samples, live_plot_kwargs, compute_convergence_checks, use_mmap, **kwargs)
    449             _print_step_hierarchy(step)
    450             try:
--> 451                 trace = _mp_sample(**sample_args)
    452             except pickle.PickleError:
    453                 _log.warning("Could not pickle model, sampling singlethreaded.")

~/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/pymc3/sampling.py in _mp_sample(draws, tune, step, chains, cores, chain, random_seed, start, progressbar, trace, model, use_mmap, **kwargs)
    999         try:
   1000             with sampler:
-> 1001                 for draw in sampler:
   1002                     trace = traces[draw.chain - chain]
   1003                     if trace.supports_sampler_stats and draw.stats is not None:

~/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/pymc3/parallel_sampling.py in __iter__(self)
    303
    304         while self._active:
--> 305             draw = ProcessAdapter.recv_draw(self._active)
    306             proc, is_last, draw, tuning, stats, warns = draw
    307             if self._progress is not None:

~/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/pymc3/parallel_sampling.py in recv_draw(processes, timeout)
    221         if msg[0] == 'error':
    222             old = msg[1]
--> 223             six.raise_from(RuntimeError('Chain %s failed.' % proc.chain), old)
    224         elif msg[0] == 'writing_done':
    225             proc._readable = True

~/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/six.py in raise_from(value, from_value)

RuntimeError: Chain 0 failed.

这对于 NUTS 或简单的 Metropolis 算法来说都是非常出乎意料的,因为对数似然函数的梯度对于这个模型来说非常简单,我使用了一个合理且信息丰富的先验(设置一个小的标准差为 3),并且数据1500000 x 500 的尺寸非常小,可以舒适地放入内存中,并且不会导致标准统计包(如 statsmodels 或 scikit-learn )出现问题(它们使用二阶导数等,与 NUTS 非常相似)。

我正在寻找有关需要哪些采样器设置来获得完整数据的快速采样率的指导,以便我可以采样更长的时间以进行收敛。

标签: pythonoptimizationstatisticspymc3mcmc

解决方案


推荐阅读