首页 > 解决方案 > 使用自定义 X 和 Y 数据训练 TFBertForSequenceClassification

问题描述

我正在研究一个 TextClassification 问题,为此我正在尝试在 Huggingface-transformers 库中给出的 TFBertForSequenceClassification 上训练我的模型。

我按照他们的github页面上给出的示例进行操作,我可以使用给定的示例数据运行示例代码tensorflow_datasets.load('glue/mrpc')。但是,我找不到有关如何加载我自己的自定义数据并将其传入的示例 model.fit(train_dataset, epochs=2, steps_per_epoch=115, validation_data=valid_dataset, validation_steps=7)

如何定义我自己的 X,对我的 X 进行标记化并使用我的 X 和 Y 准备 train_dataset。其中 X 代表我的输入文本,Y 代表给定 X 的分类类别。

样本训练数据框:

    text    category_index
0   Assorted Print Joggers - Pack of 2 ,/ Gray Pri...   0
1   "Buckle" ( Matt ) for 35 mm Width Belt  0
2   (Gagam 07) Barcelona Football Jersey Home 17 1...   2
3   (Pack of 3 Pair) Flocklined Reusable Rubber Ha...   1
4   (Summer special Offer)Firststep new born baby ...   0

标签: nlppytorchtensorflow2.0huggingface-transformersbert-language-model

解决方案


HuggingFace带有自定义数据集文件的转换器的好例子真的不多。

我们先导入需要的库:

import numpy as np
import pandas as pd

import sklearn.model_selection as ms
import sklearn.preprocessing as p

import tensorflow as tf
import transformers as trfs

并定义所需的常量:

# Max length of encoded string(including special tokens such as [CLS] and [SEP]):
MAX_SEQUENCE_LENGTH = 64 

# Standard BERT model with lowercase chars only:
PRETRAINED_MODEL_NAME = 'bert-base-uncased' 

# Batch size for fitting:
BATCH_SIZE = 16 

# Number of epochs:
EPOCHS=5

现在是时候读取数据集了:

df = pd.read_csv('data.csv')

然后从预训练的 BERT 中定义所需的模型进行序列分类:

def create_model(max_sequence, model_name, num_labels):
    bert_model = trfs.TFBertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
    
    # This is the input for the tokens themselves(words from the dataset after encoding):
    input_ids = tf.keras.layers.Input(shape=(max_sequence,), dtype=tf.int32, name='input_ids')

    # attention_mask - is a binary mask which tells BERT which tokens to attend and which not to attend.
    # Encoder will add the 0 tokens to the some sequence which smaller than MAX_SEQUENCE_LENGTH, 
    # and attention_mask, in this case, tells BERT where is the token from the original data and where is 0 pad token:
    attention_mask = tf.keras.layers.Input((max_sequence,), dtype=tf.int32, name='attention_mask')
    
    # Use previous inputs as BERT inputs:
    output = bert_model([input_ids, attention_mask])[0]

    # We can also add dropout as regularization technique:
    #output = tf.keras.layers.Dropout(rate=0.15)(output)

    # Provide number of classes to the final layer:
    output = tf.keras.layers.Dense(num_labels, activation='softmax')(output)

    # Final model:
    model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=output)
    return model

现在我们需要使用定义的函数来实例化模型,并编译我们的模型:

model = create_model(MAX_SEQUENCE_LENGTH, PRETRAINED_MODEL_NAME, df.target.nunique())

opt = tf.keras.optimizers.Adam(learning_rate=3e-5)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

为标记化创建一个函数(将文本转换为标记):

def batch_encode(X, tokenizer):
    return tokenizer.batch_encode_plus(
    X,
    max_length=MAX_SEQUENCE_LENGTH, # set the length of the sequences
    add_special_tokens=True, # add [CLS] and [SEP] tokens
    return_attention_mask=True,
    return_token_type_ids=False, # not needed for this type of ML task
    pad_to_max_length=True, # add 0 pad tokens to the sequences less than max_length
    return_tensors='tf'
)

加载分词器:

tokenizer = trfs.BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)

将数据拆分为训练和验证部分:

X_train, X_val, y_train, y_val = ms.train_test_split(df.text.values, df.category_index.values, test_size=0.2)

编码我们的集合:

X_train = batch_encode(X_train)
X_val = batch_encode(X_val)

最后,我们可以使用训练集拟合我们的模型,并在每个 epoch 之后使用验证集进行验证:

model.fit(
    x=X_train.values(),
    y=y_train,
    validation_data=(X_val.values(), y_val),
    epochs=EPOCHS,
    batch_size=BATCH_SIZE
)

推荐阅读