首页 > 解决方案 > 用余数迭代 Huggingface 分词器

问题描述

Transformer 模型具有最大令牌限制。如果我想对我的文本进行子串化以适应该限制,那么普遍接受的方式是什么?

由于对特殊字符的处理,标记器不会将其标记映射到适合循环的东西。天真地:

subst = " ".join(mytext.split(" ")[0:MAX_LEN])

会让我循环使用类似的块:

START = 0
i = 0
substr = []
while START+MAX_LEN < len(mytext.split(" ")):
  substr[i] = " ".join(mytext.split(" ")[START:START+MAX_LEN])
  START = START + MAX_LEN
  i = i + 1
  tokens = tokenizer(text)

但是," ".join(mytext.split(" ")[0:MAX_LEN])不等于 给出的长度tokenizer(text)

您可以在下面看到不同之处:

>>> from transformers import LongformerTokenizer
>>> tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')

>>> mytext = "This is a long sentence. " * 2000 # about 10k tokens

>>> len(mytext.split(" "))
10001

>>> encoded_input = tokenizer(mytext) 
Token indices sequence length is longer than the specified maximum sequence length for this model (12003 > 4096). Running this sequence through the model will result in indexing errors

tokenizer对于较长的文档,普遍接受的迭代过程的函数参数是什么,或者如果没有可用的参数是什么?

标签: huggingface-tokenizers

解决方案


推荐阅读