python - 在 pymc3 逻辑回归中使用 NUTS 调试收敛和慢速采样问题
问题描述
我在同一主题的GitHub 问题上交叉发布。
我正在尝试确定 pymc 中默认 NUTS 算法在 150 万个观测值和 500 个协变量(1500000 x 500 设计矩阵)的数据集上拟合标准逻辑模型的慢采样率和收敛问题的原因。
下面是一些创建这种大小的玩具数据集的代码:
import numpy as np
covariates = np.random.randn(1500000, 499)
covariates = np.hstack((np.ones((1500000, 1)), covariates))
true_coefficients = 5 * np.random.rand(500)
true_logits = np.dot(covariates, true_coefficients)
true_probs = 1.0 / (1.0 + np.exp(-true_logits))
observed_labels = (np.random.rand(1500000) < true_probs).astype(np.int32)
从这里很容易创建一个statsmodels
或sklearn.linear_model.LogisticRegression
模型拟合,例如使用以下代码:
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression(max_iter=1000, fit_intercept=False, verbose=1)
logistic.fit(covariates, observed_labels)
# above takes 3-5 minutes on my machine, appears to converge well.
import matplotlib.pyplot as plt
plt.scatter(true_coefficients, logistic.coef_[0, :])
plt.show()
# above scatter plot shows good accuracy in the point estimate.
即使对于完整的数据集,sklearn 模型也能相当快地收敛,并产生一个与真实系数相比具有良好精度的解参数向量。
但是使用下面的 pymc 代码,无论是使用 NUTS 还是简单的 Metropolis 采样器,我都经常看到无法接受的缓慢采样时间。我尝试更改两者的许多默认设置,对数据进行二次采样以及各种其他技巧,但没有运气。
import pymc3 as pm
import theano.tensor as tt
with pm.Model() as logistic_model:
beta = pm.Normal('beta', 0.0, sd=3.0, shape=500)
p = 1.0 / (1.0 + tt.exp(-tt.dot(covariates, beta)))
likelihood = pm.Bernoulli('likelihood', p, observed=observed_labels)
with logistic_model:
tr = pm.sample(1000, njobs=2, nchains=2)
# In the sample step above I have also tried all types of variations on using
# step = pm.NUTS(), step = pm.Metropolis(), start = pm.find_MAP(),
# start = {'beta': np.zeros(500)}, and adjustment of all sorts of options with
# kwargs to these steppers.
# I have also tried wrapping `covariates` and `observed_labels` with pm.Minibatch
# with batch sizes ranging from 1000 to 200000 observations per sample. With NUTS,
# the minibatches made it run drastically slower, over 40 seconds per sample.
对于 NUTS 算法,我发现如果我不手动将起始值设置为零向量或使用pm.find_MAP
(这在 pymc 文档中是不鼓励的),否则我会在初始化过程中得到一个“坏能量”错误:
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/espears/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 73, in run
self._start_loop()
File "/Users/espears/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 113, in _start_loop
point, stats = self._compute_point()
File "/Users/espears/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 139, in _compute_point
point, stats = self._step_method.step(self._point)
File "/Users/espears/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/pymc3/step_methods/arraystep.py", line 247, in step
apoint, stats = self.astep(array)
File "/Users/espears/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 117, in astep
'might be misspecified.' % start.energy)
ValueError: Bad initial energy: inf. The model might be misspecified.
"""
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
ValueError: Bad initial energy: inf. The model might be misspecified.
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
~/programming/pymc-testing/logistic_example.py in <module>()
45 with logistic_model:
46 #step, start = pm.Metropolis(), {'beta': np.zeros(num_coefficients)}
---> 47 tr = pm.sample(1000, njobs=2, nchains=2)
48 print(f"{time.time() - st}: finished_sampling...")
49
~/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/pymc3/sampling.py in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, nuts_kwargs, step_kwargs, progressbar, model, random_seed, live_plot, discard_tuned_samples, live_plot_kwargs, compute_convergence_checks, use_mmap, **kwargs)
449 _print_step_hierarchy(step)
450 try:
--> 451 trace = _mp_sample(**sample_args)
452 except pickle.PickleError:
453 _log.warning("Could not pickle model, sampling singlethreaded.")
~/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/pymc3/sampling.py in _mp_sample(draws, tune, step, chains, cores, chain, random_seed, start, progressbar, trace, model, use_mmap, **kwargs)
999 try:
1000 with sampler:
-> 1001 for draw in sampler:
1002 trace = traces[draw.chain - chain]
1003 if trace.supports_sampler_stats and draw.stats is not None:
~/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/pymc3/parallel_sampling.py in __iter__(self)
303
304 while self._active:
--> 305 draw = ProcessAdapter.recv_draw(self._active)
306 proc, is_last, draw, tuning, stats, warns = draw
307 if self._progress is not None:
~/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/pymc3/parallel_sampling.py in recv_draw(processes, timeout)
221 if msg[0] == 'error':
222 old = msg[1]
--> 223 six.raise_from(RuntimeError('Chain %s failed.' % proc.chain), old)
224 elif msg[0] == 'writing_done':
225 proc._readable = True
~/anaconda3/envs/py36-pymc/lib/python3.6/site-packages/six.py in raise_from(value, from_value)
RuntimeError: Chain 0 failed.
这对于 NUTS 或简单的 Metropolis 算法来说都是非常出乎意料的,因为对数似然函数的梯度对于这个模型来说非常简单,我使用了一个合理且信息丰富的先验(设置一个小的标准差为 3),并且数据1500000 x 500 的尺寸非常小,可以舒适地放入内存中,并且不会导致标准统计包(如 statsmodels 或 scikit-learn )出现问题(它们使用二阶导数等,与 NUTS 非常相似)。
我正在寻找有关需要哪些采样器设置来获得完整数据的快速采样率的指导,以便我可以采样更长的时间以进行收敛。
解决方案
推荐阅读
- zoom-sdk - 为什么 Zoom websdk 在 FF 上不起作用?有解决方法吗?
- python - 来自 S3 存储桶的 Fastai 输入图像
- java - 当我将一些 java 代码从另一个在线 IDE 复制到 Visual Studio 代码时,它显示“未解决的编译错误”
- python - 错误 ValueError: y 应该是一维数组,取而代之的是一个形状为 (1, 4) 的数组。使用 RASA 解释器解析文本时
- android - Android / IOS 单个应用程序可以同时具有订阅和完整购买方式吗?
- json - NoSuchMethodError:在 null 上调用了方法 '[]' - Flutter
- php - Woocommerce 按可用性和 sku 订购产品
- java - 为什么java并行编程显示出不可预知的结果?
- powershell - 类型分配的添加成员怪异
- web - 将网站 url 粘贴到 MS Teams 时使用图像加载网站特定内容