python - TypeError: zeros_like(): 在 MLM 上微调时参数“输入”
问题描述
基本概述
我正在对Longformer
在自定义数据集上预训练的蒙面语言模型(准确地说是香草)进行预训练。我正在使用 Huggingface 的transformers
库,但是在监督任务上微调我的 MLM 时,出现此错误:-
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-175-83ded7c85bb3> in <module>()
45 )
46
---> 47 train_results = trainer.train()
4 frames
/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
1118 tr_loss += self.training_step(model, inputs)
1119 else:
-> 1120 tr_loss += self.training_step(model, inputs)
1121 self._total_flos += float(self.floating_point_ops(inputs))
1122
/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in training_step(self, model, inputs)
1522 loss = self.compute_loss(model, inputs)
1523 else:
-> 1524 loss = self.compute_loss(model, inputs)
1525
1526 if self.args.n_gpu > 1:
/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs)
1554 else:
1555 labels = None
-> 1556 outputs = model(**inputs)
1557 # Save past state if it exists
1558 # TODO: this needs to be fixed and made cleaner later.
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
887 result = self._slow_forward(*input, **kwargs)
888 else:
--> 889 result = self.forward(*input, **kwargs)
890 for hook in itertools.chain(
891 _global_forward_hooks.values(),
/usr/local/lib/python3.7/dist-packages/transformers/models/longformer/modeling_longformer.py in forward(self, input_ids, attention_mask, global_attention_mask, head_mask, token_type_ids, position_ids, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
1841 if global_attention_mask is None:
1842 logger.info("Initializing global attention on CLS token...")
-> 1843 global_attention_mask = torch.zeros_like(input_ids)
1844 # global attention on cls token
1845 global_attention_mask[:, 0] = 1
TypeError: zeros_like(): argument 'input' (position 1) must be Tensor, not NoneType
现在,这似乎最有可能源于模型的输入——Huggingface 已经升级到现在处理直接datasets
对象,并且已经放弃了以前的 BPE 类。
对于我的数据,我将我的两个train
和validation
NumPy 数组都转储到了两个单独的文件中:-
src{tgt
....{17
其中{
---> 分隔符和src
和tgt
是列。此结构适用于 train 和 val 文件。重要的是,我的src
= 字符串;tgt
= 数字(数字标签)。这是一个序列分类任务。
接下来,我datasets
使用 CSV 脚本从文件中构造对象:-
from datasets import load_dataset
train_dataset = load_dataset('csv', data_files="HF_train.txt", delimiter='{')
val_dataset = load_dataset('csv', data_files="HF_val.txt", delimiter='{')
此步骤完成后,我tokenizer = AutoTokenizer.from_pretrained('......')
从我的预训练语言模型中导入标记器 -->truncation & padding = True
代码的可疑部分
现在,是时候进行标记化了。我使用该.map()
方法将标记化功能应用于我的整个数据集。这就是我的标记化函数的样子:-
def tok(example):
encodings = tokenizer(example['src'], truncation=True)
return encodings
我只应用它的原因src
是因为我的标签是数字的——那里没有标记,只有“X”值是一个长字符串。
这就是我使用该函数应用于我的数据集的方式(使用.map()
):-
train_encoded_dataset = train_dataset.map(tok, batched=True)
val_encoded_dataset = val_dataset.map(tok, batched=True)
很可能,这是我搞砸的部分,因为我不知道如何使用该datasets
对象。
这就是我的 datasets 对象的样子,希望你能比我更了解它的结构:-
>>> train_dataset
>>> DatasetDict({
train: Dataset({
features: ['src', 'tgt'],
num_rows: 4572
})
})
>>> train_dataset['train']
>>> Dataset({
features: ['src', 'tgt'],
num_rows: 4572
})
>>> train_dataset['train']['src']
>>> [..Giant list of all sequences.present..in.dataset --> **(untokenized)**]
>>>train_dataset['train'][0]
>>>{'src': 'Kasam.....',
'tgt': 13}
现在,我探索所谓的标记化数据集(train_encoded_dataset
):-
>>> train_encoded_dataset
>>> DatasetDict({
train: Dataset({
features: ['attention_mask', 'input_ids', 'src', 'tgt'],
num_rows: 4572
})
})
>>> train_encoded_dataset['train']
>>> Dataset({
features: ['attention_mask', 'input_ids', 'src', 'tgt'],
num_rows: 4572
})
>>> print(train_encoded_dataset['train'][0])
>>> {'attention_mask': [1, 1, 1, 1, 1, 1, 1,...Long list of several numbers than increase..], 'src': 'Kasa....Long string which is a focument', 'tgt': 13} #tgt being label
在此之后,我将此数据集传递到Trainer
:-
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted',zero_division=1) #none gives score for each class
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
'f1': f1,
'precision': precision,
'recall': recall
}
training_args = TrainingArguments(
output_dir='/content/results/', # output directory
overwrite_output_dir = True,
num_train_epochs=16, # total number of training epochs
per_device_train_batch_size=32, # batch size per device during training
per_device_eval_batch_size=32, # batch size for evaluation
warmup_steps=600, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='/content/logs', # directory for storing logs
logging_steps=10,
evaluation_strategy='epoch',
learning_rate=1e-6,
#fp16 = True,
load_best_model_at_end = True,
metric_for_best_model = 'eval_loss',
greater_is_better = False,
seed = 101,
save_total_limit=5,
)
model = AutoModelForSequenceClassification.from_pretrained(".....", num_labels=20)
trainer = Trainer(
model=model, # the instantiated Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset['train'], # training dataset
eval_dataset=val_dataset['train'], # evaluation dataset
compute_metrics=compute_metrics
)
train_results = trainer.train()
这导致上面发布的错误。
现在我不确定问题出在哪里(除了标记化)。谁能指出来?
Update1:使用该from_dict
方法构造数据集(用于将 numpy 数组本地转换为datasets
对象)会产生相同的错误。
Update2:显然,经过一些更改,我收到了一个新错误:- 这是新tok
功能:
def tok(example):
encodings = tokenizer(example['src'], truncation=True, padding=True)
return encodings
再次添加填充和截断。经过适当的训练参数(通过标记化而不是未标记化)
trainer = Trainer(
model=model, # the instantiated Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_encoded_dataset, # training dataset
eval_dataset=val_encoded_dataset, # evaluation dataset
compute_metrics=compute_metrics
)
产生这个: -
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-78-6068ea33d5d4> in <module>()
45 )
46
---> 47 train_results = trainer.train()
4 frames
/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
1099 self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
1100
-> 1101 for step, inputs in enumerate(epoch_iterator):
1102
1103 # Skip past any already trained steps if resuming training
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
515 if self._sampler_iter is None:
516 self._reset()
--> 517 data = self._next_data()
518 self._num_yielded += 1
519 if self._dataset_kind == _DatasetKind.Iterable and \
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
555 def _next_data(self):
556 index = self._next_index() # may raise StopIteration
--> 557 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
558 if self._pin_memory:
559 data = _utils.pin_memory.pin_memory(data)
/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
45 else:
46 data = self.dataset[possibly_batched_index]
---> 47 return self.collate_fn(data)
/usr/local/lib/python3.7/dist-packages/transformers/data/data_collator.py in default_data_collator(features)
78 batch[k] = torch.stack([f[k] for f in features])
79 else:
---> 80 batch[k] = torch.tensor([f[k] for f in features])
81
82 return batch
ValueError: expected sequence of length 2033 at dim 1 (got 2036)
这是出乎意料的。将对此进行更多研究。
解决方案
推荐阅读
- github - GitHub / GitLab 合并请求和下一次提交
- android - Flutter - 通过应用程序使用蓝牙连接两台设备,无需先配对
- c# - 具有索引的并行 foreach 总和
- firebase - 在颤振中,将文件上传到firebase存储后,如何获取公共URL?
- react-native - ImagePicker.showImagePicker 不是函数
- python - 在 Selenium/Python/Firefox 中为远程 webdriver 添加扩展
- c# - Telerik Grid,将 AdditionalFileds 绑定到远程验证
- r - 有没有办法在 R 中处理一个混乱的变量?
- python-3.x - 无法安装 mgba python 模块
- javascript - 如何淡入和淡出视频中添加的控件,如默认控件