python - 如何使用地图或循环来解码变压器数据集?
问题描述
我加载了一个包含文本列的数据集,我想翻译它们。
为了加快这个过程,我尝试使用转换器数据集。
model_size = "base"
model_name = f"persiannlp/mt5-{model_size}-parsinlu-translation_en_fa"
tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)
dataset = load_dataset('csv', data_files=dfname, split='train')
dataset = dataset.map(lambda e: tokenizer(e['input_text'], padding='longest'))
dataset.set_format(type='torch', columns=['input_ids'])
# map for generating translation
#dataset = dataset.map(lambda e: {"trans":model.generate(e['input_ids'])})
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
for batch in tqdm(dataloader):
input_ids = batch["input_ids"]
res = model.generate(input_ids)
target = tokenizer.batch_decode(res, skip_special_tokens=True)
首先,我尝试调用model.generate
另一个map
给出此错误的方法(在代码中注释):
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/transformers/generation_utils.py", line 378, in _prepare_encoder_decoder_kwargs_for_generation
model_kwargs["encoder_outputs"]: ModelOutput = encoder(input_ids, return_dict=True, **encoder_kwargs)
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 877, in forward
batch_size, seq_length = input_shape
ValueError: not enough values to unpack (expected 2, got 1)
然后我尝试在循环中调用它,但它给出了以下循环错误:
Traceback (most recent call last):
File "prepare_natural.py", line 146, in <module>
for batch in tqdm(dataloader):
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/tqdm/std.py", line 1129, in __iter__
for obj in iterable:
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 73, in default_collate
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 73, in <dictcomp>
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [17] at entry 0 and [15] at entry 1
解决方案
推荐阅读
- vb.net - 查询表达式 '[void]'=' 中缺少语法错误运算符
- gitlab - GitLab 运行器使用 shell 执行器失败:作业失败(系统故障):准备环境:
- c++ - 具有非类型参数的类的非类型参数成员函数的部分特化
- javascript - JAVASCRIPT 我不能使用 cloneNode?
- vert.x - Vert.x——如何使用 Future 并行执行任务/方法
- excel - 如果 Excel 中包含,则删除单元格中的行
- html - 可点击元素靠得太近?网站在移动设备上是移动设备友好的,但在谷歌移动设备上不友好检查
- qt - 如何在 QT 的同一个函数中休眠?
- flutter - 在 Dart 中显示所有列表属性
- javascript - ResponsiveVoice - 避免将 API 密钥硬编码到 index.html