首页 > 解决方案 > 命名实体识别模型总是预测相同的类别,但准确率达到 99%

问题描述

我想制作 Keras NER 模型,该模型将标记文本中的亵渎/脏话。我有超过 50k 行/句子的数据集,但 50k 中只有 2000 行包含脏话。我已经用完整的数据集训练了我的模型,并且只使用了包含脏话的行,我得到了相同的结果。损失小于 0.1,准确率超过 99%,但是,当我想预测时,它对所有单词进行了相同的标记(就像那些单词不是脏话一样)。

我列举了每一行中的所有单词和标签:

max_len = 50

X = [[word2idx.get(w[0], 0) for w in s] for s in list_of_sentances]
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=vocab_len-1)

y = [[label2idx[w[1]] for w in s] for s in list_of_sentances]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=label2idx["O"])
y = [to_categorical(i, num_classes=num_labels) for i in y]

这是我的模型:

input_word = Input(shape=(max_len, ))

model = Embedding(input_dim = vocab_len+1, output_dim = 75, input_length = max_len)(input_word)
model = SpatialDropout1D(0.25)(model)
model = Bidirectional(LSTM(units = 50, return_sequences=True, recurrent_dropout = 0.2))(model)
out = TimeDistributed(Dense(num_labels, activation = "softmax"))(model)

model = Model(input_word, out)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_3 (InputLayer)         [(None, 50)]              0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 50, 75)            1506000   
_________________________________________________________________
spatial_dropout1d_2 (Spatial (None, 50, 75)            0         
_________________________________________________________________
bidirectional_2 (Bidirection (None, 50, 100)           50400     
_________________________________________________________________
time_distributed_2 (TimeDist (None, 50, 3)             303       
=================================================================
Total params: 1,556,703
Trainable params: 1,556,703
Non-trainable params: 0

opt = Adam(lr = 0.000075)
model.compile(optimizer = opt, loss="categorical_crossentropy", metrics=["accuracy"])

es = EarlyStopping(monitor='val_loss', min_delta=0.0001, patience=2, verbose=0, mode='auto')
history = model.fit(x_train, 
                    y_train, 
                    validation_data=(x_test, y_test),
                    epochs=100, 
                    batch_size=64,
                    callbacks = [es], 
                    verbose=2)

score = model.evaluate(x_test, y_test, batch_size=64)
print("\nSCORE:", score)

模型训练结果:

...
...
Epoch 55/100
4846/4846 - 8s - loss: 0.0193 - acc: 0.9940 - val_loss: 0.0307 - val_acc: 0.9933
1212/1212 [==============================] - 0s 254us/sample - loss: 0.0307 - acc: 0.9933

预测(对不起,不好的话):

max_len = 50
list_of_sentances = ["Fucking fuck fuck you asshole bullshit fuck you bitch"]
word_num = list_of_sentances[0].split(" ")
word_num = len(word_num)

test = [[word2idx.get(w[0], 0) for w in s] for s in list_of_sentances]
test = pad_sequences(maxlen=max_len, sequences=test, padding="post", value=vocab_len-1)

pred = model.predict(test)
pred = pred.argmax(axis=-1)[0][:word_num]

labels = {v: k for k, v in label2idx.items()}

prediction = [labels[word] for word in pred]

print(labels)
print(prediction)

{0: 'O', 1: 'profanity'}
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

你能告诉我我做错了什么吗?当我想查找组织名称、人名等时,我对 NER 模型尝试了相同的原理……并且我得到了很好的结果(这是我遵循的教程https://djajafer.medium.com/named-entity-使用 keras-4db04e22503d 进行识别和分类)。

我不能使用 class_weights 因为我有序列。我的“课程”示例如下所示:

No shit .           O profanity O
Ya bitch !          profanity profanity O
Shut the fuck up!   profanity profanity profanity profanity

标签: pythontensorflowmachine-learningkerasnamed-entity-recognition

解决方案


正如其他成员所说,该指标为您提供 99% 的准确度,因为它同时考虑了脏话和非脏话:将所有单词标记为非脏话已经会给您带来非常高的准确率,因为数据集是不平衡的。

您可能应该使用主要用于 ML/NLP 的 fscore 指标(精度/召回率的组合),因为它专注于特定的类。简而言之,它不会考虑真正的否定词(即正确识别的非脏话),而是关注真阳性(被识别为脏话的脏话),同时考虑假阳性(被识别为脏话的非脏话)和假阴性(发誓不承认)分别用于精确度和召回率。


推荐阅读