首页 > 解决方案 > 如何使用地图或循环来解码变压器数据集?

问题描述

我加载了一个包含文本列的数据集,我想翻译它们。

为了加快这个过程,我尝试使用转换器数据集。

model_size = "base"
model_name = f"persiannlp/mt5-{model_size}-parsinlu-translation_en_fa"
tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)
dataset = load_dataset('csv', data_files=dfname, split='train')

dataset = dataset.map(lambda e: tokenizer(e['input_text'], padding='longest'))


dataset.set_format(type='torch', columns=['input_ids'])

# map for generating translation
#dataset = dataset.map(lambda e: {"trans":model.generate(e['input_ids'])})




dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
for batch in tqdm(dataloader):
    input_ids = batch["input_ids"]
    res = model.generate(input_ids)
    target = tokenizer.batch_decode(res, skip_special_tokens=True)

首先,我尝试调用model.generate另一个map给出此错误的方法(在代码中注释):

File "/home/pouramini/miniconda3/lib/python3.7/site-packages/transformers/generation_utils.py", line 378, in _prepare_encoder_decoder_kwargs_for_generation
    model_kwargs["encoder_outputs"]: ModelOutput = encoder(input_ids, return_dict=True, **encoder_kwargs)
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 877, in forward
    batch_size, seq_length = input_shape
ValueError: not enough values to unpack (expected 2, got 1)

然后我尝试在循环中调用它,但它给出了以下循环错误:

Traceback (most recent call last):
  File "prepare_natural.py", line 146, in <module>
    for batch in tqdm(dataloader):
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/tqdm/std.py", line 1129, in __iter__
    for obj in iterable:
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 73, in default_collate
    return {key: default_collate([d[key] for d in batch]) for key in elem}
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 73, in <dictcomp>
    return {key: default_collate([d[key] for d in batch]) for key in elem}
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [17] at entry 0 and [15] at entry 1


标签: pythonpytorchhuggingface-transformers

解决方案


推荐阅读