python - 使用整个数据集测试在欠采样数据上训练的分类器时,精度显着下降
问题描述
我正在做 Kaggle 信用卡欺诈检测。
Class = 1
(欺诈性交易)和Class = 0
(非欺诈性)之间存在显着的不平衡。作为补偿,我对数据进行了欠采样,使得欺诈交易和非欺诈交易之间的比率为 1:1(各 492 次)。当我在欠采样/平衡数据上训练我的逻辑回归分类器时,它表现良好。然而,当我使用相同的分类器并在整个数据集上对其进行测试时,召回率仍然很好,但准确率显着下降。
我知道对于这类问题而言,具有高召回率更为重要,但我仍然想了解为什么精度坦克,以及这是否可以。
代码:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
def model_report(y_test, pred):
print("Accuracy:\t", accuracy_score(y_test, pred))
print("Precision:\t", precision_score(y_test, pred))
print("RECALL:\t\t", recall_score(y_test, pred))
print("F1 Score:\t", f1_score(y_test, pred))
df = pd.read_csv("data/creditcard.csv")
target = 'Class'
X = df.loc[:, df.columns != target]
y = df.loc[:, df.columns == target]
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print("WITHOUT UNDERSAMPLING:")
clf = LogisticRegression().fit(x_train, y_train)
pred = clf.predict(x_test)
model_report(y_test, pred)
# Creates the undersampled DataFrame with 492 fraud and 492 clean
minority_class_len = len(df[df[target] == 1])
minority_class_indices = df[df[target] == 1].index
majority_class_indices = df[df[target] == 0].index
random_majority_indices = np.random.choice(majority_class_indices, minority_class_len, replace=False)
undersample_indices = np.concatenate([minority_class_indices, random_majority_indices])
undersample = df.loc[undersample_indices]
X_undersample = undersample.loc[:, undersample.columns != target]
y_undersample = undersample.loc[:, undersample.columns == target]
x_train, x_test, y_train, y_test = train_test_split(X_undersample, y_undersample, test_size=0.33, random_state=42)
print("\nWITH UNDERSAMPLING:")
clf = LogisticRegression().fit(x_train, y_train)
pred = clf.predict(x_test)
model_report(y_test, pred)
print("\nWITH UNDERSAMPLING & TESTING ON ENIRE DATASET:")
pred = clf.predict(X)
model_report(y, pred)
输出:
WITHOUT UNDERSAMPLING:
Accuracy: 0.9989679423750093
Precision: 0.7241379310344828
RECALL: 0.5637583892617449
F1 Score: 0.6339622641509434
WITH UNDERSAMPLING:
Accuracy: 0.9353846153846154
Precision: 0.9673202614379085
RECALL: 0.9024390243902439
F1 Score: 0.9337539432176657
WITH UNDERSAMPLING & TESTING ON ENIRE DATASET:
Accuracy: 0.9595936897618387
Precision: 0.03760913364674278
RECALL: 0.9105691056910569
F1 Score: 0.07223476297968398
解决方案
推荐阅读
- reactjs - Typescript + React:将类作为外部函数 arg 传递
- ios - 即使在键盘关闭时也保持搜索栏取消按钮的颜色
- ios - 轻按时隐藏 UIDatePicker
- php - WAMP 正在重定向没有扩展名的 php 请求
- android - 如何将字符串值转换为整数?
- swift - contentsof:url 加载截断 URL 的 url 内容
- python - 使用数组数组从 Python 创建 CSV 的问题
- android - Android Studio 模拟器屏幕被“划伤”并变形
- python - 排序时间戳python
- reactjs - 在 react native 中使用 Flatlist