python - 在 Python 中使用 statsmodels 和 PyMC3(MCMC 模拟)估计两个比例之间差异的 p 值
问题描述
在 Probabilistic-Programming-and-Bayesian-Methods-for-Hackers 中,提出了一种计算两个比例不同的 p 值的方法。
(您可以在此处找到包含整章 http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC2.ipynb的 jupyter notebook )
代码如下:
import pymc3 as pm
figsize(12, 4)
#these two quantities are unknown to us.
true_p_A = 0.05
true_p_B = 0.04
N_A = 1700
N_B = 1700
#generate some observations
observations_A = bernoulli.rvs(true_p_A, size=N_A)
observations_B = bernoulli.rvs(true_p_B, size=N_B)
print(np.mean(observations_A))
print(np.mean(observations_B))
0.04058823529411765
0.03411764705882353
# Set up the pymc3 model. Again assume Uniform priors for p_A and p_B.
with pm.Model() as model:
p_A = pm.Uniform("p_A", 0, 1)
p_B = pm.Uniform("p_B", 0, 1)
# Define the deterministic delta function. This is our unknown of interest.
delta = pm.Deterministic("delta", p_A - p_B)
# Set of observations, in this case we have two observation datasets.
obs_A = pm.Bernoulli("obs_A", p_A, observed=observations_A)
obs_B = pm.Bernoulli("obs_B", p_B, observed=observations_B)
# To be explained in chapter 3.
step = pm.Metropolis()
trace = pm.sample(20000, step=step)
burned_trace=trace[1000:]
p_A_samples = burned_trace["p_A"]
p_B_samples = burned_trace["p_B"]
delta_samples = burned_trace["delta"]
# Count the number of samples less than 0, i.e. the area under the curve
# before 0, represent the probability that site A is worse than site B.
print("Probability site A is WORSE than site B: %.3f" % \
np.mean(delta_samples < 0))
print("Probability site A is BETTER than site B: %.3f" % \
np.mean(delta_samples > 0))
Probability site A is WORSE than site B: 0.167
Probability site A is BETTER than site B: 0.833
但是,如果我们使用 计算 p 值statsmodels
,我们会得到一个非常不同的结果:
from scipy.stats import norm, chi2_contingency
import statsmodels.api as sm
s1 = int(1700 * 0.04058823529411765)
n1 = 1700
s2 = int(1700 * 0.03411764705882353)
n2 = 1700
p1 = s1/n1
p2 = s2/n2
p = (s1 + s2)/(n1+n2)
z = (p2-p1)/ ((p*(1-p)*((1/n1)+(1/n2)))**0.5)
z1, p_value1 = sm.stats.proportions_ztest([s1, s2], [n1, n2])
print('z1 is {0} and p is {1}'.format(z1, p))
z1 is 0.9948492584166934 and p is 0.03735294117647059
使用MCMC
,p 值似乎是 0.167,但使用statsmodels
,我们得到 0.037 的 p 值。
我怎么能理解这个?
解决方案
看起来您打印了错误的值。试试这个:
print('z1 is {0} and p is {1}'.format(z1, p_value1))
此外,如果您想测试假设 p_A > p_B 那么您应该将alternative
函数调用中的参数设置为larger
:
z1, p_value1 = sm.stats.proportions_ztest([s1, s2], [n1, n2], alternative='larger')
文档有更多关于如何使用它的示例。
推荐阅读
- javascript - 将 json 中的 json 显示为 2 个并排的立方体,而不是列表
- php - Magento 产品出口问题 1.9.3.6
- node.js - 如何将 socket.io 连接导出到另一个控制器
- python - 如何使用 Python 将 Amazon Ion 文件转换为 JSON 格式?
- powershell - Powershell:设置环境变量加倍
- julia - Juno IDE 中的 Julia 程序没有输出
- swift - Firebase 数据库无法识别儿童
- linux - 从特定日期获取 1 个月前
- git - 使用过滤器分支删除 Git LFS?(无法将未更改的文件重新添加到索引。)
- javascript - 在 Jinja 模板中的 JavaScript 中转义字符