python - 如何使用 SHAP ,PDP ?并且模型的特征数量必须与输入误差相匹配
问题描述
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/kaglee/pokemon_alopez247.csv")
我对数据进行了预处理。
df.columns
Index(['dark', 'dragon', 'electric', 'fairy', 'fighting', 'fire', 'flying',
'ghost', 'grass', 'ground', 'ice', 'normal', 'poison', 'psychic',
'rock', 'steel', 'Ditto', 'bug', 'dragon', 'fairy', 'field', 'flying',
'grass', 'human-Like', 'mineral', 'monster', 'water_1', 'water_2',
'water_3', 'legend', 'Color_Blue', 'Color_Brown', 'Color_Green',
'Color_Grey', 'Color_Pink', 'Color_Purple', 'Color_Red', 'Color_White',
'Color_Yellow', 'body_bipedal_tailless', 'body_four_wings',
'body_head_arms', 'body_head_base', 'body_head_legs', 'body_head_only',
'body_insectoid', 'body_multiple_bodies', 'body_quadruped',
'body_serpentine_body', 'body_several_limbs', 'body_two_wings',
'body_with_fins'],
dtype='object')
df.shape
(721, 52)
X = df.drop('legend',axis=1)
y = df['legend']
X.shape,y.shape # ((721, 51), (721,))
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80, test_size=0.20,
stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y,test_size=0.20)
# Logistic model
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_train_scaled
X_train_scaled.T[0].mean(), X_train_scaled.T[0].std()
model = LogisticRegression(random_state=1)
model.fit(X_train_scaled, y_train)
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_val_scaled)
accuracy_score(y_val, y_pred) # 0.9310344827586207
X_test_scaled = scaler.transform(X_test)
y_pred_test = model.predict(X_test_scaled)
accuracy_score(y_test, y_pred_test) # 0.9448275862068966
# RandomForest model
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
rfc = make_pipeline(RandomForestClassifier(criterion = 'entropy',n_jobs=-1, random_state=2,n_estimators=200))
rfc.fit(X_train, y_train)
print('test accuracy', rfc.score(X_test, y_test)) # 0.9379
# PDP , SHAP
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 80
from pdpbox.pdp import pdp_isolate, pdp_plot
import shap
import numpy as np
feature = 'bug'
isolated = pdp_isolate(
model=model,
dataset=X_train,
model_features=X_train.columns,
feature=feature
);
ValueError:意外的输入维度 59,预期为 51
ValueError:模型的特征数量必须与输入匹配。模型 n_features 为 51,输入 n_features 为 59
这两种类型的错误就出现了。
我之所以写,是因为即使搜索 Stack Overflow,我也无法理解我做错了什么。
feature = " " 无论我输入什么功能,错误都会出现。
解决方案
推荐阅读
- qt - 如何将 ScaleBarOverlay 添加到 arcgis 地图中
- javascript - why property does not exist on type in object destructuring?
- generics - Can't handle specific status code error in Retrofit with Rxjava while using a base extension fuction for handilng Exception
- django - Django Model Some problems
- sql-server - 在sql server中获取最后插入的varchar类型的主键
- css - 使用全局 styles.scss 来控制每个 Angular 组件的 :host
- python - 如何在 python 数据框中使重复的列值为空?
- python-3.x - 使用 Django REST 令牌授权访问 API 时获取页面未找到错误
- replace - 崇高文本 - 将多个选择复制到行尾
- excel - 循环范围内的单元格,范围内的每个单元格复制并粘贴到多个工作簿