首页 > 解决方案 > “ValueError:您必须在使用 Trainer 时指定 input_ids 或 inputs_embeds”

问题描述

"ValueError: You have to specify either input_ids or inputs_embeds"从一个看似简单的培训示例中得到:

Iteration:   0%|                                                                                                                                                             | 0/6694 [00:00<?, ?it/s]
Epoch:   0%|                                                                                                                                                                    | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "train_masked_lm.py", line 33, in <module>
    trainer.train()
  File "/home/zm/anaconda3/envs/electra/lib/python3.7/site-packages/transformers/trainer.py", line 503, in train
    tr_loss += self._training_step(model, inputs, optimizer)
  File "/home/zm/anaconda3/envs/electra/lib/python3.7/site-packages/transformers/trainer.py", line 629, in _training_step
    outputs = model(**inputs)
  File "/home/zm/anaconda3/envs/electra/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zm/anaconda3/envs/electra/lib/python3.7/site-packages/transformers/modeling_electra.py", line 639, in forward
    return_tuple,
  File "/home/zm/anaconda3/envs/electra/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zm/anaconda3/envs/electra/lib/python3.7/site-packages/transformers/modeling_electra.py", line 349, in forward
    raise ValueError("You have to specify either input_ids or inputs_embeds")
ValueError: You have to specify either input_ids or inputs_embeds

我的目标是采用预训练模型并根据其他数据进一步训练它。变压器新手。一定是做错了什么。请帮忙!

我改编了https://huggingface.co/blog/how-to-train如下:

from transformers import (
  ElectraForMaskedLM,
  ElectraTokenizer,
  Trainer,
  TrainingArguments,
  LineByLineTextDataset
)

model = ElectraForMaskedLM.from_pretrained('google/electra-base-generator')
tokenizer = ElectraTokenizer.from_pretrained('google/electra-base-generator')

def to_dataset(input_file):
  return LineByLineTextDataset(file_path=input_file, tokenizer=tokenizer, block_size=128)


training_args = TrainingArguments(
  output_dir='./output',
  overwrite_output_dir=True,
  num_train_epochs=3,
  per_device_train_batch_size=64,
  per_device_eval_batch_size=64,
  save_steps=10000,
  warmup_steps=500,
  logging_dir='./logs',
)

trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=to_dataset('...../lines.txt'), # \n-separated lines of text (sentences)
)
trainer.train()

上述错误在脚本启动几秒钟后触发,并且是第一个输出。

标签: huggingface-transformershuggingface-tokenizers

解决方案


推荐阅读