python - 使用 pymc3 使用贝叶斯逻辑回归进行预测
问题描述
我正在尝试使用 pymc3 执行贝叶斯逻辑回归,但我在使用模型执行预测时遇到了问题。
数据:
我的数据集是房贷违约数据,样本数据如下:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
1 1700 0548 40320 HomeImp Other 9 0 0 101.466002 1 8 37.113614
1 1800 28502 43034 HomeImp Other 11 0 0 88.766030 0 8 36.884894
0 2300 102370 120953 HomeImp Office 2 0 0 90.992533 0 13 31.588503
问题:
我想对测试数据集执行预测,一种方法是使用共享变量方法:
X_shared = theano.shared(X_train)
with pm.Model() as logistic_model_pred:
pm.glm.GLM(x=X_shared,
y=y_train,
labels=labels,
family=pm.glm.families.Binomial())
X_shared.set_value(X_test)
ppc = pm.sample_ppc(pred_trace,
model=logistic_model_pred,
samples=100)
但是,使用上面的代码(theano 共享变量)会导致以下问题:
错误信息:
AsTensorError: ('Variable type field must be a TensorType.', <Generic>, <theano.gof.type.Generic object at 0x00000216ABB16730>)
可能的解决方案:
使用以下代码确实解决了这个问题,但我不知道如何将相同的模型用于测试数据。
with pm.Model() as logistic_model_pred:
pm.glm.GLM.from_formula('BAD ~ DELINQ + DEROG + DEBTINC + NINQ + CLNO + VALUE + MORTDUE + YOJ + LOAN + CLAGE + JOB',
data=pd.concat([y_train.reset_index(drop=True), X_train], axis=1),
family=pm.glm.families.Binomial())
pred_trace = pm.sample(tune=1500,
draws=1000,
chains=4,
cores=1,
init='adapt_diag')
完整代码:
%matplotlib inline
from pathlib import Path
import pickle
from collections import OrderedDict
import pandas as pd
import numpy as np
from scipy import stats
import multiprocessing
import arviz as az
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import (roc_curve, roc_auc_score, confusion_matrix, accuracy_score, f1_score,
precision_recall_curve, balanced_accuracy_score)
from mlxtend.plotting import plot_confusion_matrix
import theano
import pymc3 as pm
from pymc3.variational.callbacks import CheckParametersConvergence
import statsmodels.formula.api as smf
import arviz as az
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
from IPython.display import HTML
import sys
if not sys.warnoptions:
import warnings
warnings.simplefilter("ignore")
# intialise data of lists.
data = {'BAD':[1,1,0,1,0,0,0,1,1,0,0,1,0,0,1,0,1],
'LOAN':[1700,1800,2300,2400,2400,2900,2900,2900,2900,
3000,3200,3300,3600,3600,3700,3800,3900],
'MORTDUE':[30548,28502,102370,34863,98449,103949,104373,7750,61962,104570,
74864,130518,100693,52337,17857,51180,29896],
'VALUE':[40320,43034,120953,47471,117195,112505,120702,67996,70915,121729,
87266,164317,114743,63989,21144,63459,45960],
'REASON':['HomeImp','HomeImp','HomeImp','HomeImp','HomeImp',
'HomeImp','HomeImp','HomeImp',
'DebtCon','HomeImp','HomeImp','DebtCon','HomeImp','HomeImp',
'HomeImp','HomeImp','HomeImp'],
'JOB':['Other','Other','Office','Mgr','Office','Office','Office',
'Other','Mgr','Office','ProfExe',
'Other','Office','Office','Other','Office','Other'],
'YOJ':[9,11,2,12,4,1,2,16,2,2,7,9,6,20,5,20,11],
'DEROG':[0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0],
'DELINQ':[0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0],
'CLAGE':[101.4660019,88.76602988,90.99253347,70.49108003,
93.81177486,96.10232967,101.5402975,
122.2046628,282.8016592,85.8843719,250.6312692,
192.289149,88.47045214,204.2724988,
129.7173231,203.7515336,146.1232422],
'NINQ':[1,0,0,1,0,0,0,2,3,0,0,0,0,0,1,0,0],
'CLNO':[8,8,13,21,13,13,13,8,37,14,12,33,14,20,9,20,14],
'DEBTINC':[37.11361356,36.88489409,31.58850318,38.26360073,
29.68182705,30.05113629,29.91585903,
36.211348,49.20639579,32.05978327,42.90999735,
35.73055919,29.39354338,20.47091551,
26.63434752,20.06704205,24.47888119]
}
# Create DataFrame
data = pd.DataFrame(data)
# datatype defining
data[['BAD', 'LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ',
'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']] = data[['BAD', 'LOAN', 'MORTDUE',
'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC' ]].apply(pd.to_numeric)
data[['REASON', 'JOB']] = data[['REASON', 'JOB']].apply(lambda x: x.astype('category'))
print(data.dtypes)
data.dropna(axis=0, how='any',inplace=True)
# test train split
X = data.drop('BAD', axis=1)
y = data.BAD
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12345)
labels = X_train.columns
# model training (error cause)
X_shared = theano.shared(X_train)
with pm.Model() as logistic_model_pred:
pm.glm.GLM(x=X_shared,
y=y_train,
labels=labels,
family=pm.glm.families.Binomial())
# Prediction on test data
X_shared = theano.shared(X_test)
ppc = pm.sample_ppc(pred_trace,
model=logistic_model_pred,
samples=100)
# AUC
np.mean(ppc['y'], axis=0).shape
y_score = np.mean(ppc['y'], axis=0)
roc_auc_score(y_score=np.mean(ppc['y'], axis=0),
y_true=y_test)
pred_scores = dict(y_true=y_test, y_score=y_score)
cols = ['False Positive Rate', 'True Positive Rate', 'threshold']
roc = pd.DataFrame(dict(zip(cols, roc_curve(**pred_scores))))
precision, recall, ts = precision_recall_curve(y_true=y_test, probas_pred=y_score)
pr_curve = pd.DataFrame({'Precision': precision, 'Recall': recall})
f1 = pd.Series({t: f1_score(y_true=y_test, y_pred=y_score>t) for t in ts})
best_threshold = f1.idxmax()
# Alternative solution
with pm.Model() as logistic_model_pred:
pm.glm.GLM.from_formula('BAD ~ DELINQ + DEROG + DEBTINC + NINQ +
CLNO + VALUE + MORTDUE + YOJ + LOAN + CLAGE + JOB',
data=pd.concat([y_train.reset_index(drop=True), X_train], axis=1),
family=pm.glm.families.Binomial())
pred_trace = pm.sample(tune=1500,
draws=1000,
chains=4,
cores=1,
init='adapt_diag')
解决方案
如果你用下面的代码替换你的# model training (error cause)
和# AUC
注释之间的代码,你应该能够运行它并开始得到一些结果:
# model training (error cause)
X_train2 = X_train[X_train.columns[0:3]].values
scaler = preprocessing.StandardScaler()
scaler.fit(X_train2)
X_train2 = scaler.transform(X_train2)
X_shared = theano.shared(X_train2) #theano.shared(X_train)
with pm.Model() as logistic_model_pred:
pm.glm.GLM(x=X_shared,
y=y_train.values,
labels=labels[0:3],
family=pm.glm.families.Binomial())
trace = pm.sample()
# Prediction on test data
X_test2 = scaler.transform(X_test[X_train.columns[0:3]].values)
#X_shared = theano.shared(X_test2)
X_shared.set_value(X_test2)
ppc = pm.sample_ppc(trace,
model=logistic_model_pred,
samples=100)
# AUC
我进行了以下更改:
- 我已经更改了进入
theano.shared
numpy 数组的变量。 - 这意味着需要使用一种热编码或类似的东西
X_train
来转换字符串列。我没有这样做,因此我只使用了恰好是数字的前 3 列 - 在运行 pymc3 之前,我还使用标准缩放器重新缩放输入。
- 最后,对于后验预测,我使用更改了共享变量的值
.set_value
采样给出了许多分歧,但我认为以上内容为您提供了设置。
推荐阅读
- xcode - Xcode11 编辑器(通过“将编辑器添加到右侧”)被禁用
- php - 实体关系仅返回带有 Symfony 的 ID
- mysql - 如何使用 mySQL 返回另一列中每个值的最常见列值?
- node.js - 从 github 拉取后,Npm 开发服务器将无法启动
- arrays - 如何从函数中的读取命令返回数组?
- snowflake-cloud-data-platform - 创建一个由几个表组成的视图
- sql - id 与外键不同的 SQL 命令 (MariaDB)
- c# - 来自另一种语言的 golang c-shared 库回调
- python - 如何在 Autosys 中捕获 Python 脚本的错误代码?
- python - 在 PIL 中保存文件时出现 TypeError