首页 > 解决方案 > 在 TensorFlow 中训练令牌分类 Transformer 模型

问题描述

我使用变压器库中的 TFXLMRobertaForTokenClassification,我想在 conllu2003 上训练 NER 模型。我有一个问题,即模型只是不学习数据。没有太多的错误空间,这就是我认为数据预处理存在问题的原因。

所以,我下载了conll2003

datasets = load_dataset('conll2003')

经过一些转换后,我得到了以下数据集:

print(type(tfdataset_test))

<class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>

这就是单批次的样子

next(tfdataset_test.batch(1).as_numpy_iterator())

{'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32),
 'input_ids': array([[     0,    159,  29065, 109003,     20, 226403,      6,  79794,
          30684,    441,  32272,      6, 108568,      6,      4,  14045,
          13933,   5881, 111166, 112583,   9127,      6, 202001, 145688,
              6,      5,      2,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0]], dtype=int32),
 'labels': array([[-100,    0,    0,    0,    0,    5,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    1,    1,    0,    0,    0,    0,    0,
            0,    0,    0,    0, -100,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0]], dtype=int32)}

在我看来,这正是docs中描述的方式。

然后,我训练一个模型:

model = TFXLMRobertaForTokenClassification.from_pretrained('jplu/tf-xlm-roberta-large',num_labels=len(label_list))

model.fit(x=tfdataset_train,batch_size=1024, epochs=N_EPOCHS)

考虑到它是预训练模型,我基本上得到了大约 0.0024 的准确度。

我希望有人检查我上面数据集的格式,并可能给我一些在 tf.token 分类的用例。

标签: tensorflownamed-entity-recognitiontransformer

解决方案


推荐阅读