python - 无论字符串大小如何,BERT 输出的形状都可以固定吗?
问题描述
我对使用 huggingface BERT 模型以及如何使它们产生固定形状的预测感到困惑,而不管输入大小(即输入字符串长度)如何。
我尝试使用参数调用标记器padding=True, truncation=True, max_length = 15
,但预测输出尺寸inputs = ["a", "a"*20, "a"*100, "abcede"*20000]
不固定。我在这里想念什么?
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = ["a", "a"*20, "a"*100, "abcede"*20000]
for input in inputs:
inputs = tokenizer(input, padding=True, truncation=True, max_length = 15, return_tensors="pt")
outputs = model(**inputs)
print(outputs.last_hidden_state.shape, input, len(input))
输出:
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
torch.Size([1, 3, 768]) a 1
torch.Size([1, 12, 768]) aaaaaaaaaaaaaaaaaaaa 20
torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 100
torch.Size([1, 3, 768]) abcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcededeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeab....deabbcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcede 120000
解决方案
当您只用一个句子和padding=True
, truncation=True
,调用分词器时max_length = 15
,它会将输出序列填充到最长的输入序列并在需要时截断。由于您只提供一个句子,因此标记器无法填充任何内容,因为它已经是批处理中最长的序列。这意味着您可以通过两种方式实现您想要的:
- 提供一批:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = ["a", "a"*20, "a"*100, "abcede"*200]
inputs = tokenizer(inputs, padding=True, truncation=True, max_length = 15, return_tensors="pt")
print(inputs["input_ids"])
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
输出:
tensor([[ 101, 1037, 102, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0],
[ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
2050, 102, 0, 0, 0],
[ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
11057, 11057, 11057, 11057, 102],
[ 101, 100, 102, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0]])
torch.Size([4, 15, 768])
- 设置
padding="max_length"
:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = ["a", "a"*20, "a"*100, "abcede"*200]
for i in inputs:
inputs = tokenizer(i, padding='max_length', truncation=True, max_length = 15, return_tensors="pt")
print(inputs["input_ids"])
outputs = model(**inputs)
print(outputs.last_hidden_state.shape, i, len(i))
输出:
tensor([[ 101, 1037, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0]])
torch.Size([1, 15, 768]) a 1
tensor([[ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
2050, 102, 0, 0, 0]])
torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaa 20
tensor([[ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
11057, 11057, 11057, 11057, 102]])
torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 100
tensor([[101, 100, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0]])
torch.Size([1, 15, 768]) abcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcede 1200
推荐阅读
- python - 在 WSL2 中运行的 VS Code 中的 Jupyter Notebook 不遵循 VS Code 颜色主题
- docker - 容器以状态码 (1) 退出
- flutter - Flutter:我们可以将 Canvas/CustomPainter 保存为 gif 或连续图片或事件视频吗?
- github - Google Cloud Build 神秘错误的 git 哈希
- python - 如何在 Python 中制作动画:视频编辑器不接受文件
- mongoose - 需要帮助来了解如何解决嵌套在多个猫鼬查询和映射数组中的承诺
- c++ - cmd 命令如何在后台工作?
- data-science - 属于组的变量的统计检验
- r - if_else 未按预期返回 NA(而是返回错误条件)
- unit-testing - sandbox.restore() 不会重置存根的调用计数