prediction - BERT 用于下一句预测
问题描述
我正在尝试使用我自己的数据集微调 Bert 模型以进行下一句预测,但它不起作用。谁能告诉我我的数据集的结构应该是什么以及如何使用拥抱脸训练器()进行微调?
def train(bert_model,bert_tokenizer,path,eval_path=None):
out_dir = "/content/drive/My Drive/next_sentence/"
training_args = TrainingArguments(output_dir=out_dir,
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=30,
save_steps=100,
save_total_limit=5,
)
data_collator = DataCollatorForLanguageModeling(tokenizer=bert_tokenizer)
trainer = Trainer(
model=bert_model,
args=training_args,
data_collator=data_collator,
train_dataset="c:/data.txt",
tokenizer=BertTokenizer)
trainer.train()
trainer.save_model(out_dir)
import transformers
from torch.nn.functional import softmax
from transformers import BertTokenizer, BertTokenizerFast, BertForNextSentencePrediction,TextDatasetForNextSentencePrediction
import torch
from transformers import Trainer, TrainingArguments
from transformers.data.data_collator import DataCollatorForLanguageModeling
def main():
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
bert_model = BertForNextSentencePrediction.from_pretrained("bert-base-cased")
train_data_set_path = "c:/data.txt"
train(bert_model,BertTokenizer,train_data_set_path)
#prepare_data_set(bert_tokenizer)
main()
解决方案
您应该创建TextDatasetForNextSentencePrediction
并将其传递给训练器,而不是传递数据集路径。
所以你应该TextDatasetForNextSentencePrediction
在你的训练函数中创建数据集,如下所示。
from transformers import TextDatasetForNextSentencePrediction
def train(bert_model, bert_tokenizer, path, eval_path=None):
out_dir = "/content/drive/My Drive/next_sentence/"
training_args = TrainingArguments(output_dir=out_dir,
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=30,
save_steps=100,
save_total_limit=5,
)
data_collator = DataCollatorForLanguageModeling(tokenizer=bert_tokenizer)
train_dataset = TextDatasetForNextSentencePrediction(
tokenizer = bert_tokenizer,
file_path = path,
block_size = 256
)
trainer = Trainer(
model=bert_model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
tokenizer=BertTokenizer)
trainer.train()
trainer.save_model(out_dir)
此外,您应该通过bert_tokenizer
而不是BertTokenizer
. 训练器和数据集需要预先训练的分词器。
所以你的主要功能应该是这样的:
def main():
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
bert_model = BertForNextSentencePrediction.from_pretrained("bert-base-cased")
train_data_set_path = "c:/data.txt"
train(bert_model, bert_tokenizer, train_data_set_path)
main()
推荐阅读
- r - 在 R 中按匹配的因子水平传播
- django - 使用 nginx 和 daphne 部署 django、channels 和 websockets
- javascript - 如何获取每月数据库记录的数量 Sequelize
- ms-access - 关于在拆分表单数据表中打开下一条记录的 MS Access 问题
- java - HQL 缺少关键字
- mysql - MySQL查询计算大于10的结果然后分组
- arduino - ESP32 + TLC5940
- python - How to get the mean of a 3D array over 2D?
- javascript - 如何使用弹出式 Chrome 扩展程序请求和捕获地理位置
- authentication - 每个用户的 Directus 访问令牌