首页 > 解决方案 > 变形金刚:WordLevel 分词器产生奇怪的词汇

问题描述

训练 WordLevel 分词器 我收到了奇怪的词汇。贝娄是我的代码:

data = [
    "Beautiful is better than ugly."
    "Explicit is better than implicit."
    "Simple is better than complex."
    "Complex is better than complicated."
    "Flat is better than nested."
    "Sparse is better than dense."
    "Readability counts."
]

from tokenizers.models import WordLevel
from tokenizers import Tokenizer, models, normalizers, pre_tokenizers, decoders, trainers

tokenizer = Tokenizer(models.WordLevel())

trainer = trainers.WordLevelTrainer(
    vocab_size=100000,
)

tokenizer.train_from_iterator(data, trainer=trainer)

tokenizer.get_vocab()

输出如下:

{'Beautiful is better than ugly.Explicit is better than implicit.Simple is better than complex.Complex is better than complicated.Flat is better than nested.Sparse is better than dense.Readability counts.': 0}

请解释我做错了什么......

标签: pythonhuggingface-transformershuggingface-tokenizers

解决方案


您的数据定义不正确,您的数据的 len() 是一。它需要逗号,如下所示:

data = [
    "Beautiful is better than ugly.",
    "Explicit is better than implicit.",
    "Simple is better than complex.",
    "Complex is better than complicated.",
    "Flat is better than nested.",
    "Sparse is better than dense.",
    "Readability counts.",
]

此外,您想传入一系列序列,您可以使用 map() 函数来应用 split(),如下所示:

tokenizer.train_from_iterator(map(lambda x: x.split(), data), trainer=trainer)

推荐阅读