huggingface-transformers - “ValueError:您必须在使用 Trainer 时指定 input_ids 或 inputs_embeds”
问题描述
我"ValueError: You have to specify either input_ids or inputs_embeds"
从一个看似简单的培训示例中得到:
Iteration: 0%| | 0/6694 [00:00<?, ?it/s]
Epoch: 0%| | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
File "train_masked_lm.py", line 33, in <module>
trainer.train()
File "/home/zm/anaconda3/envs/electra/lib/python3.7/site-packages/transformers/trainer.py", line 503, in train
tr_loss += self._training_step(model, inputs, optimizer)
File "/home/zm/anaconda3/envs/electra/lib/python3.7/site-packages/transformers/trainer.py", line 629, in _training_step
outputs = model(**inputs)
File "/home/zm/anaconda3/envs/electra/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/zm/anaconda3/envs/electra/lib/python3.7/site-packages/transformers/modeling_electra.py", line 639, in forward
return_tuple,
File "/home/zm/anaconda3/envs/electra/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/zm/anaconda3/envs/electra/lib/python3.7/site-packages/transformers/modeling_electra.py", line 349, in forward
raise ValueError("You have to specify either input_ids or inputs_embeds")
ValueError: You have to specify either input_ids or inputs_embeds
我的目标是采用预训练模型并根据其他数据进一步训练它。变压器新手。一定是做错了什么。请帮忙!
我改编了https://huggingface.co/blog/how-to-train如下:
from transformers import (
ElectraForMaskedLM,
ElectraTokenizer,
Trainer,
TrainingArguments,
LineByLineTextDataset
)
model = ElectraForMaskedLM.from_pretrained('google/electra-base-generator')
tokenizer = ElectraTokenizer.from_pretrained('google/electra-base-generator')
def to_dataset(input_file):
return LineByLineTextDataset(file_path=input_file, tokenizer=tokenizer, block_size=128)
training_args = TrainingArguments(
output_dir='./output',
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=64,
per_device_eval_batch_size=64,
save_steps=10000,
warmup_steps=500,
logging_dir='./logs',
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=to_dataset('...../lines.txt'), # \n-separated lines of text (sentences)
)
trainer.train()
上述错误在脚本启动几秒钟后触发,并且是第一个输出。
解决方案
推荐阅读
- arrays - 发出信号 SIGABRT 的程序:在一个代码中处理中止信号,而在其他代码中不处理
- python - Python3 - [Errno 32] 使用套接字时管道损坏
- php - Laravel jwt auth 令牌签名无法验证
- c++ - std::bind 在 std::array 的 operator[] 上
- php - 使用 PHPUnit 和 Laravel 测试聚合
- java - 从 swagger 向 REST WebServices 发送请求结束到 401 并且未经授权
- localization - 如何在空手道框架中使用小黄瓜本地化?
- angular - 如何从 Angularfire 承诺创建的可观察对象中捕获错误
- yaml - 如何动态迭代从机器人框架中的yaml文件导入的变量
- javascript - 带有 javascript 的单选按钮