python - 拥抱脸变压器模型:KeyError:BERT 模型训练开始时的“input_ids”消息
问题描述
使用 Huggingface 转换器库,当我为掩码语言建模任务微调 BERT 语言模型时,我在最后一步遇到了一个错误。我希望针对尚未训练模型的领域金融语料库对其进行微调。但是,当我调用模型进行训练时,我收到以下错误消息:KeyError:'input_ids'。下面提供的是我采取的步骤和代码。任何见解表示赞赏!
首先,我从 pandas 数据框创建了一个数据集对象,该数据框又从一个包含 1 列多行文本的 csv 文件创建:
unlabelled_dataset = Dataset.from_pandas(unlabelled)
其次,然后我使用以下代码对数据集进行标记:
tokenizerBERT = BertTokenizerFast.from_pretrained('bert-base-uncased') #BERT model tokenization & check
tokenizerBERT(unlabelled_dataset['paragraphs'], padding=True, truncation=True)
tokenizerBERT.save_pretrained('tokenizers/pytorch/labelled/BERT/')
第三,我按照说明创建了一个数据整理器:
data_collator_BERT = DataCollatorForLanguageModeling(tokenizer=tokenizerBERT, mlm=True, mlm_probability=0.15)
接下来,我选择我的模型 from_pretrained 以获得迁移学习的好处:
model_BERT = BertForMaskedLM.from_pretrained("bert-base-uncased")
接下来,我将我的训练参数传递给变压器训练器并初始化:
training_args_BERT = TrainingArguments(
output_dir="./BERT",
num_train_epochs=10,
evaluation_strategy='steps',
warmup_steps=10000,
weight_decay=0.01,
per_gpu_train_batch_size=64,
)
trainer_BERT = Trainer(
model=model_BERT,
args=training_args_BERT,
data_collator=data_collator_BERT,
train_dataset=unlabelled_dataset,
)
最后,我调用模型进行训练并得到错误 KeyError: 'input_ids'
trainer_BERT.train()
关于如何调试这种方法来训练模型的任何见解?
下面提供的是收到的确切错误消息:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-9-83b7063dea0b> in <module>
----> 1 trainer_BERT.train()
2 trainer.save_model("./models/royalBERT")
~/anaconda3/lib/python3.7/site-packages/transformers/trainer.py in train(self, model_path, trial)
755 self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
756
--> 757 for step, inputs in enumerate(epoch_iterator):
758
759 # Skip past any already trained steps if resuming training
~/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
361
362 def __next__(self):
--> 363 data = self._next_data()
364 self._num_yielded += 1
365 if self._dataset_kind == _DatasetKind.Iterable and \
~/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
401 def _next_data(self):
402 index = self._next_index() # may raise StopIteration
--> 403 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
404 if self._pin_memory:
405 data = _utils.pin_memory.pin_memory(data)
~/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
45 else:
46 data = self.dataset[possibly_batched_index]
---> 47 return self.collate_fn(data)
~/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py in __call__(self, examples)
193 ) -> Dict[str, torch.Tensor]:
194 if isinstance(examples[0], (dict, BatchEncoding)):
--> 195 examples = [e["input_ids"] for e in examples]
196 batch = self._tensorize_batch(examples)
197 if self.mlm:
~/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py in <listcomp>(.0)
193 ) -> Dict[str, torch.Tensor]:
194 if isinstance(examples[0], (dict, BatchEncoding)):
--> 195 examples = [e["input_ids"] for e in examples]
196 batch = self._tensorize_batch(examples)
197 if self.mlm:
KeyError: 'input_ids'
解决方案
虽然分词器是通过 传递的DataCollator
,但我认为我们必须对数据执行分词:
因此,我们需要对数据执行标记化,如下所示:
train_dataset = tokenizer.encode(unlabeled_data, add_special_tokens=True, return_tensors="pt")
trainer_BERT = Trainer(
model=model_BERT,
args=training_args_BERT,
data_collator=data_collator_BERT,
train_dataset=train_dataset,
)
推荐阅读
- java - Kotlin:将具有多种值类型的 HashMap 传递给函数
- python - 在 python 循环中运行子进程
- python - cx_Freeze 不是该平台支持的滚轮
- reactjs - React + Redux Saga 在 MongoDB 数据变化时刷新和重新渲染列表
- python - 为库设置 PYTHONPATH 环境变量,但无法在 Windows 中导入
- java - 在 Spring MVC rest api 中实现用户认证
- .net - 在 Web API 端点中传递多个数据列表
- angular - 初始加载后如何调用角度表单构建器值更改
- facebook - facebook sdk-如何激活 facebook 登录设置?
- spring - Spring MQ 消费者容错