首页 > 解决方案 > 为什么我的逻辑回归只产生一个类?

问题描述

我使用来自 Kaggle 的虚构数据集尝试了我的第一个机器学习项目,该数据集包含 1470 条记录。84% 的记录属于“0”类,16% 属于“1”类。我使用了 1200 条记录来训练和测试,并保存了 270 条作为新数据输入以查看会发生什么。我最终得到了 87% 的训练分数和 83% 的测试分数,但是所有 270 条新数据记录都被归类为 0。

会不会是虚构的数据不足以形成足够的模式来教机器如何分类?还是我做错了什么?

我已经阅读了其他一些似乎涉及类似问题的帖子,但我没有找到相关的回复。任何帮助,将不胜感激。

df=pd.read_csv('Resources/train_data.csv')
    
df_skinny =df.drop(['EducationField','EmployeeCount','EmployeeNumber','index',
    'StandardHours', 
    'JobRole','MaritalStatus','DailyRate','MonthlyRate','HourlyRate','Over18','OverTime'], 
    axis=1).drop_duplicates()
    df_skinny.rename(columns={"Attrition": "EmploymentStatus"}, inplace=True)
    df_skinny['EmploymentStatus'] = df_skinny['EmploymentStatus'].replace(['Yes','No'],[1,0])

df_skinny['Gender']=df_skinny['Gender'].replace(['Female','Male'],[0,1]) df_skinny['BusinessTravel'] = df_skinny['BusinessTravel'].replace([' Travel_Rarely','Travel_Frequently','Non-Travel'],[1,2,0]) df_skinny['Department']=df_skinny['Department'].replace(['Human Resources','Sales','R&D '],[0,1,2])

df_train=df_skinny[:1200]
df_new=df_skinny[1201:]

X =df_train.drop("EmploymentStatus", axis=1)
y = df_train["EmploymentStatus"]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

from sklearn.preprocessing import StandardScaler
X_scaler = StandardScaler().fit(X_train)
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()

classifier.fit(X_train_scaled, y_train)

print(f"Training Data Score: {classifier.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test_scaled, y_test)}")

predictions = classifier.predict(X_test_scaled)
print(f"First 30 Predictions:   {predictions[:30]}")
print(f"First 30 Actual Employment Status: {y_test[:30].tolist()}")

new_X = df_new.drop("EmploymentStatus", axis=1)
new_predictions=classifier.predict(new_X)
print(new_predictions)

ynew = classifier.predict_proba(new_X)
print(ynew)

OUTPUT:
Training Data Score: 0.8655555555555555
Testing Data Score: 0.8333333333333334

First 30 Predictions:   [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0]

First 30 Actual Employment Status: [1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0] 

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0]

[[1.00000000e+000 0.00000000e+000]
 [1.00000000e+000 0.00000000e+000]
 [1.00000000e+000 0.00000000e+000]
 [1.00000000e+000 0.00000000e+000]
 [1.00000000e+000 5.24119991e-298]
 [1.00000000e+000 7.88999798e-158]
 [1.00000000e+000 2.73485216e-286]
 [1.00000000e+000 0.00000000e+000]
 [1.00000000e+000 0.00000000e+000]

标签: pythonlogistic-regression

解决方案


正如您所提到的,84% 的数据属于 0 类,16% 属于 1 类。这是非常不平衡的数据,在这种情况下模型会非常有偏差。这就是为什么你得到的结果大多为 0。

一个好的数据集是在所有类之间具有平衡数据的东西。Random sampling因此,您需要使用技术使其平衡。有两种采样oversamplingundersampling

我建议您首先应用采样技术来平衡您的数据。

您可以从以下文章中了解更多信息 https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/

你可以参考这个笔记本 https://www.kaggle.com/shweta2407/oversampling-vs-undersampling-techniques


推荐阅读