首页 > 解决方案 > 如何在 MLM 和 NSP 的新域上从头开始训练 BERT?

问题描述

标签: deep-learningnlpbert-language-modelhuggingface-transformerstransformer

解决方案


BertForPretraining TextDatasetForNextSentencePrediction DataCollatorForLanguageModeling您可以使用和的组合轻松地从头开始在 MLM 和 NSP 任务上训练 BERT Trainer

我不建议您先训练您的模型 MLM,然后再训练 NSP,这可能会导致灾难性的遗忘。它基本上忘记了你从以前的培训中学到的东西。

  1. 加载预训练的标记器。
from transformers import BertTokenizer
bert_cased_tokenizer = BertTokenizer.from_pretrained("/path/to/pre-trained/tokenizer/for/new/domain", do_lower_case=False)
  1. 初始化你的模型BertForPretraining
from transformers import BertConfig, BertForPreTraining
config = BertConfig()
model = BertForPreTraining(config)
  1. 为 NSP 任务创建数据集。TextDatasetForNextSentencePrediction将标记并为句子创建标签。您的数据集应采用以下格式:(或者您可以修改现有代码)

(1) 每行一个句子。理想情况下,这些应该是实际句子 (2) 文档之间的空行

Sentence-1 From Document-1
Sentence-2 From Document-1
Sentence-3 From Document-1
...

Sentence-1 From Document-2
Sentence-2 From Document-2
Sentence-3 From Document-2
from transformers import TextDatasetForNextSentencePrediction
dataset = TextDatasetForNextSentencePrediction(
    tokenizer=bert_cased_tokenizer,
    file_path="/path/to/your/dataset",
    block_size = 256
)
  1. 用于DataCollatorForLanguageModeling屏蔽和传递从TextDatasetForNextSentencePrediction. DataCollatorForNextSentencePrediction已被删除,因为它正在做同样的事情DataCollatorForLanguageModeling
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=bert_cased_tokenizer, 
    mlm=True,
    mlm_probability= 0.15
)
  1. 训练和保存

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir= "/path/to/output/dir/for/training/arguments"
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_gpu_train_batch_size= 16,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

trainer.train()
trainer.save_model("path/to/your/model")

推荐阅读