huggingface-tokenizers - 用余数迭代 Huggingface 分词器
问题描述
Transformer 模型具有最大令牌限制。如果我想对我的文本进行子串化以适应该限制,那么普遍接受的方式是什么?
由于对特殊字符的处理,标记器不会将其标记映射到适合循环的东西。天真地:
subst = " ".join(mytext.split(" ")[0:MAX_LEN])
会让我循环使用类似的块:
START = 0
i = 0
substr = []
while START+MAX_LEN < len(mytext.split(" ")):
substr[i] = " ".join(mytext.split(" ")[START:START+MAX_LEN])
START = START + MAX_LEN
i = i + 1
tokens = tokenizer(text)
但是," ".join(mytext.split(" ")[0:MAX_LEN])
不等于 给出的长度tokenizer(text)
。
您可以在下面看到不同之处:
>>> from transformers import LongformerTokenizer
>>> tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
>>> mytext = "This is a long sentence. " * 2000 # about 10k tokens
>>> len(mytext.split(" "))
10001
>>> encoded_input = tokenizer(mytext)
Token indices sequence length is longer than the specified maximum sequence length for this model (12003 > 4096). Running this sequence through the model will result in indexing errors
tokenizer
对于较长的文档,普遍接受的迭代过程的函数参数是什么,或者如果没有可用的参数是什么?