python - Why is my tensorflow Roberta Model unable to train/finetune?
问题描述
We are trying to finetune / train our RoBERTa model on our own train data. The project is exactly the same as the SemEval-2020 task B on choosing the right reason out of 3 on why a sentence is against common sense. For the past two days we have been struggling with errors, mainly when trying to train/finetune our model. The code we have used comes from https://huggingface.co/transformers/model_doc/roberta.html#robertamodel . Although we have tried to alter this code in multiple ways we can't seem to really start training our model. Our main problem is the data we try to train the model on. We have tried to immediately insert a numpy array or pandas dataframe, but to no avail. Finally we tried to use a tfds. We used the following code, which results in the error code which can be found below.
I install the following packages:
# Modules
pip install transformers
from transformers import RobertaConfig, RobertaModel
from transformers import RobertaTokenizer, RobertaForMultipleChoice
from transformers import AutoModel, AutoTokenizer
import tensorflow as tf
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
I import my train and test sets as csv files, where after the data is cleaned and concatenated.
# Read csv files
sample_data = pd.read_csv("sample.csv")
train_data = pd.read_csv("train_data.csv")
train_answers = pd.read_csv("train_answers.csv")
test_data = pd.read_csv("test_data.csv")
train_data['A_concat'] = train_data['FalseSent'] + '. ' + train_data['OptionA']
train_data['B_concat'] = train_data['FalseSent'] + '. ' + train_data['OptionB']
train_data['C_concat'] = train_data['FalseSent'] + '. ' + train_data['OptionC']
train_data.drop(['OptionA','OptionB','OptionC'],axis=1,inplace=True)
train_data.columns = ['id', 'FalseSent', 'OptionA', 'OptionB', 'OptionC']
I make a tfds of the train and test data:
tf_train_data = tf.data.experimental.CsvDataset("train_data.csv", record_defaults='Tensor', header=True)
tf_test_data = tf.data.experimental.CsvDataset("test_data.csv", record_defaults='Tensor', header=True)
These data sets are then used to train the model through the following code:
from transformers import TFRobertaForMultipleChoice, TFTrainer, TFTrainingArguments
model = TFRobertaForMultipleChoice.from_pretrained("roberta-base")
training_args = TFTrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
)
trainer = TFTrainer(
model=model,
args=training_args,
train_dataset=tf_train_data,
eval_dataset=tf_test_data
)
trainer.train()
This results in the following error:
ValueError: The training dataset must have an asserted cardinality
If anybody has any advice or can point us in the right direction on how to train/finetune our model we would be very grateful!
解决方案
您应该将整数序列 (tf.int) 提供给 Roberta 而不是 tf.string。标记您的数据
推荐阅读
- javascript - getText 函数在 array.map() 中抛出陈旧的元素
- numpy - 用简单值填充二维数组。索引出错
- c# - 在 Visual Studio 上运行 csharp 代码时出现意外的命名参数错误
- vue.js - 在运行时根据路由中的参数在nuxt中加载不同的页面
- dynamic - FreeRADIUS 能否用于基于 EAP-TLS 中证书属性的动态 VLAN 分配?
- rest - 如何从 YouTube 频道获取实时视频 ID
- javascript - 编译 TypeScript 时如何处理外部模块?
- c# - 尝试在 MS TEAMS 频道中发送自适应卡时,Bot 抛出“操作返回无效状态代码 'RequestEntityTooLarge'”异常
- r - 在 R 中删除具有聚合组的重复项
- laravel - Laravel加载主模型不起作用的多态相关模型