python - 如何使用集成模型改进一类的大规模分类报告？

问题描述

我有一个数据集，包括

{0: 6624, 1: 75} 0 表示非观察句，1 表示观察句。（基本上，我使用命名实体识别来注释我的句子，如果有像 DATA、TIME、LONG（坐标）这样的特定实体，我把标签 1）

现在我想制作一个模型来对它们进行分类，我制作的最佳模型（CV = 3 FOR ALL）是

clf= SGDClassifier()
trial_05=Pipeline([("vect",vec),("clf",clf)])

其中有：

                  precision    recall  f1-score   support

           0       1.00      1.00      1.00      6624
           1       0.73      0.57      0.64        75

   micro avg       0.99      0.99      0.99      6699
   macro avg       0.86      0.79      0.82      6699
weighted avg       0.99      0.99      0.99      669

[[6611   37]
 [  13   38]]

这个模型使用重新采样的 sgd 进行分类

                  precision    recall  f1-score   support

           0       1.00      0.92      0.96      6624
           1       0.13      1.00      0.22        75

   micro avg       0.92      0.92      0.92      6699
   macro avg       0.56      0.96      0.59      6699
weighted avg       0.99      0.92      0.95      6699

[[6104    0]
 [ 520   75]]

如您所见，这两种情况下的问题都是 1 类，但在第一种情况下，我们有相当好的精度和 f1 分数，而在第二种情况下，我们有很好的召回率

所以我决定以这种方式同时使用集成模型：

from sklearn.ensemble import VotingClassifier#create a dictionary of our models
estimators=[("trail_05",trial_05), ("resampled", SGD_RESAMPLED_Model)]#create our voting classifier, inputting our models
ensemble = VotingClassifier(estimators, voting='hard')

现在我有这个结果：

                precision    recall  f1-score   support

           0       0.99      1.00      1.00      6624
           1       0.75      0.48      0.59        75

   micro avg       0.99      0.99      0.99      6699
   macro avg       0.87      0.74      0.79      6699
weighted avg       0.99      0.99      0.99      6699

[[6612   39]
 [  12   36]]

当您使用时，集成模型对第 1 类具有更好的精度，但更差的召回率和 f1 socre 导致关于第 1 类的混淆矩阵更差（36 TP 对 1 类的 38 TP）

我的目标是提高第一类的 TP（f1 分数，第一类的召回）

你有什么建议来提高第一类的 TP（f1score，第一类的召回？一般来说你对我的工作流程有什么想法吗？

我已经尝试过参数调整，但我没有改进 sgd 模型。

标签： pythonnlpimbalanced-data

python - 如何使用集成模型改进一类的大规模分类报告？

问题描述

解决方案

推荐阅读