python - 训练损失和验证损失 GradientBoostingClassifier
问题描述
我正在学习对 7 个类的 Cover Type 数据进行分类。我使用 scikit-learn 的 GradientBoostingClassifier 训练我的模型。当我尝试绘制我的损失函数时,如下所示:
这种情节是否表明我的模型存在高方差?如果是,我该怎么办?而且我不知道为什么在迭代 200 到 500 的中间,绘图的形状像一个矩形。
(编辑) 要编辑这篇文章,我不确定我的代码有什么问题,因为我只是使用常规代码来拟合训练数据。我正在使用 jupyter 笔记本。所以我只提供代码
Y = train["Cover_Type"]
X = train.drop({"Cover_Type"}, axis=1)
#split training data dan cross validation
from sklearn.model_selection import train_test_split
X_train, X_val, Y_train, Y_val = train_test_split(X,Y,test_size=0.3,random_state=42)
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_friedman1
from sklearn.ensemble import GradientBoostingClassifier
params = {'n_estimators': 1000,'learning_rate': 0.3, 'max_features' : 'sqrt'}
dtree=GradientBoostingClassifier(**params)
dtree.fit(X_train,Y_train)
#mau lihat F1-Score
from sklearn.metrics import f1_score
Y_pred = dtree.predict(X_val) #prediksi data cross validation menggunakan model tadi
print Y_pred
score = f1_score(Y_val, Y_pred, average="micro")
print("Gradient Boosting Tree F1-score: "+str(score)) # I got 0.86 F1-Score
import matplotlib.pyplot as plt
# Plot training deviance
# compute test set deviance
val_score = np.zeros((params['n_estimators'],), dtype=np.float64)
for i, Y_pred in enumerate(dtree.staged_predict(X_val)):
val_score[i] = dtree.loss_(Y_val, Y_pred.reshape(-1, 1))
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title('Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, dtree.train_score_, 'b-',
label='Training Set Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, val_score, 'r-',
label='Validation Set Deviance')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Deviance')
解决方案
There are several issues that I will explain them one by one, also I have added the correct code for your example.
staged_predict(X)
method shall NOT be used- As
staged_predict(X)
outputs the predicted class instead of predicted probabilities, it is not correct to use that. - One can (where the context accept) use
staged_decision_function(X)
method and pass the computed decisions at each stage to themodel.loss_
attribute. But in this example, it does not work (the loss based on staged decision increases while the loss decreases).
- As
You should use
staged_predict_proba(X)
with cross entropy loss- you should use
staged_predict_proba(X)
- you also need to define a function that calculate the cross entropy loss at each stage.
- I have provided the code below. Note that I set the verbosity to 2, and then you can see that the sklearn training loss at each stage is the same as our loss (as a sanity check that our approach works correctly).
- you should use
Why you have big jumps
- I think the reason is that GBC becomes very confident and then predict a label is 1 (as an example) with probability one, while it is not correct (for example the label is 2). This creates big jumps (as cross entropy goes to infinity). In such a scenario you should change your GBC parameters.
The code and the Plot are given below
- The code is:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
def _cross_entropy_like_loss(model, input_data, targets, num_estimators):
loss = np.zeros((num_estimators, 1))
for index, predict in enumerate(model.staged_predict_proba(input_data)):
loss[index, :] = -np.sum(np.log([predict[sample_num, class_num-1]
for sample_num, class_num in enumerate(targets)]))
print(f'ce loss {index}:{loss[index, :]}')
return loss
covtype = fetch_covtype()
X = covtype.data
Y = covtype.target
n_estimators = 10
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.3, random_state=42)
clf = GradientBoostingClassifier(n_estimators=n_estimators, learning_rate=0.3, verbose=2 )
clf.fit(X_train, Y_train)
tr_loss_ce = _cross_entropy_like_loss(clf, X_train, Y_train, n_estimators)
test_loss_ce = _cross_entropy_like_loss(clf, X_val, Y_val, n_estimators)
plt.figure()
plt.plot(np.arange(n_estimators) + 1, tr_loss_ce, '-r', label='training_loss_ce')
plt.plot(np.arange(n_estimators) + 1, test_loss_ce, '-b', label='val_loss_ce')
plt.ylabel('Error')
plt.xlabel('num_components')
plt.legend(loc='upper right')
- The output of console is like below, from which you can easily verify the approach is correct.
Iter Train Loss Remaining Time
1 482434.6631 1.04m
2 398501.7223 55.56s
3 351391.6893 48.51s
4 322290.3230 41.60s
5 301887.1735 34.65s
6 287438.7801 27.72s
7 276109.2008 20.82s
8 268089.2418 13.84s
9 261372.6689 6.93s
10 256096.1205 0.00s
ce loss 0:[ 482434.6630936]
ce loss 1:[ 398501.72228276]
ce loss 2:[ 351391.68933547]
ce loss 3:[ 322290.32300604]
ce loss 4:[ 301887.17346783]
ce loss 5:[ 287438.7801033]
ce loss 6:[ 276109.20077844]
ce loss 7:[ 268089.2418214]
ce loss 8:[ 261372.66892149]
ce loss 9:[ 256096.1205235]
推荐阅读
- reactjs - react-codemirror2 提示功能
- for-loop - 循环的vuejs计算属性不会增加
- proxy - OpenSIPS 2.4 调用被禁止
- android-fragments - 如何防止 onViewCreated() 中两次调用 observe() 方法?
- python - 解析 600k 行数据并为每一行发出 POST 请求
- winforms - 从另一个列表框中的 selected.item 填充列表框
- python - Osmnx:从 shapefile/XML 创建图形
- rule-engine - 迭代决策表 _ODM 中的列
- react-native - 如何检测react-native(react-navigation)中的向后滑动动作?
- python - 无法通过python连接到tor网站