nlp - 使用自定义 X 和 Y 数据训练 TFBertForSequenceClassification
问题描述
我正在研究一个 TextClassification 问题,为此我正在尝试在 Huggingface-transformers 库中给出的 TFBertForSequenceClassification 上训练我的模型。
我按照他们的github页面上给出的示例进行操作,我可以使用给定的示例数据运行示例代码tensorflow_datasets.load('glue/mrpc')
。但是,我找不到有关如何加载我自己的自定义数据并将其传入的示例
model.fit(train_dataset, epochs=2, steps_per_epoch=115, validation_data=valid_dataset, validation_steps=7)
。
如何定义我自己的 X,对我的 X 进行标记化并使用我的 X 和 Y 准备 train_dataset。其中 X 代表我的输入文本,Y 代表给定 X 的分类类别。
样本训练数据框:
text category_index
0 Assorted Print Joggers - Pack of 2 ,/ Gray Pri... 0
1 "Buckle" ( Matt ) for 35 mm Width Belt 0
2 (Gagam 07) Barcelona Football Jersey Home 17 1... 2
3 (Pack of 3 Pair) Flocklined Reusable Rubber Ha... 1
4 (Summer special Offer)Firststep new born baby ... 0
解决方案
HuggingFace
带有自定义数据集文件的转换器的好例子真的不多。
我们先导入需要的库:
import numpy as np
import pandas as pd
import sklearn.model_selection as ms
import sklearn.preprocessing as p
import tensorflow as tf
import transformers as trfs
并定义所需的常量:
# Max length of encoded string(including special tokens such as [CLS] and [SEP]):
MAX_SEQUENCE_LENGTH = 64
# Standard BERT model with lowercase chars only:
PRETRAINED_MODEL_NAME = 'bert-base-uncased'
# Batch size for fitting:
BATCH_SIZE = 16
# Number of epochs:
EPOCHS=5
现在是时候读取数据集了:
df = pd.read_csv('data.csv')
然后从预训练的 BERT 中定义所需的模型进行序列分类:
def create_model(max_sequence, model_name, num_labels):
bert_model = trfs.TFBertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
# This is the input for the tokens themselves(words from the dataset after encoding):
input_ids = tf.keras.layers.Input(shape=(max_sequence,), dtype=tf.int32, name='input_ids')
# attention_mask - is a binary mask which tells BERT which tokens to attend and which not to attend.
# Encoder will add the 0 tokens to the some sequence which smaller than MAX_SEQUENCE_LENGTH,
# and attention_mask, in this case, tells BERT where is the token from the original data and where is 0 pad token:
attention_mask = tf.keras.layers.Input((max_sequence,), dtype=tf.int32, name='attention_mask')
# Use previous inputs as BERT inputs:
output = bert_model([input_ids, attention_mask])[0]
# We can also add dropout as regularization technique:
#output = tf.keras.layers.Dropout(rate=0.15)(output)
# Provide number of classes to the final layer:
output = tf.keras.layers.Dense(num_labels, activation='softmax')(output)
# Final model:
model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=output)
return model
现在我们需要使用定义的函数来实例化模型,并编译我们的模型:
model = create_model(MAX_SEQUENCE_LENGTH, PRETRAINED_MODEL_NAME, df.target.nunique())
opt = tf.keras.optimizers.Adam(learning_rate=3e-5)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
为标记化创建一个函数(将文本转换为标记):
def batch_encode(X, tokenizer):
return tokenizer.batch_encode_plus(
X,
max_length=MAX_SEQUENCE_LENGTH, # set the length of the sequences
add_special_tokens=True, # add [CLS] and [SEP] tokens
return_attention_mask=True,
return_token_type_ids=False, # not needed for this type of ML task
pad_to_max_length=True, # add 0 pad tokens to the sequences less than max_length
return_tensors='tf'
)
加载分词器:
tokenizer = trfs.BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)
将数据拆分为训练和验证部分:
X_train, X_val, y_train, y_val = ms.train_test_split(df.text.values, df.category_index.values, test_size=0.2)
编码我们的集合:
X_train = batch_encode(X_train)
X_val = batch_encode(X_val)
最后,我们可以使用训练集拟合我们的模型,并在每个 epoch 之后使用验证集进行验证:
model.fit(
x=X_train.values(),
y=y_train,
validation_data=(X_val.values(), y_val),
epochs=EPOCHS,
batch_size=BATCH_SIZE
)
推荐阅读
- c# - 如何快速重置 XML 文件和保存的类以创建新游戏 - 在 UNITY 中
- amazon-web-services - 使用 AWS Config 更改 iam 策略时的通知
- python - 我收到以下错误消息: ValueError: cannot reindex from a duplicate axis
- c# - 如何仅在我的 c# 应用程序中读取 RFID 阅读器输出
- ios - 在tableview单元格内创建动态collectionView
- java - Java:应用程序启动方法中的异常 java.lang.reflect.InvocationTargetException
- python - Python从列表中删除每三个元素
- c# - 如何访问图片 Xamarin.Forms、SQLite、C#
- python - 如何在 Python 中从 .xlsx 读取时间?
- angular - 无法将不透明度应用于:主机颜色变量