python - 变形金刚:WordLevel 分词器产生奇怪的词汇
问题描述
训练 WordLevel 分词器 我收到了奇怪的词汇。贝娄是我的代码:
data = [
"Beautiful is better than ugly."
"Explicit is better than implicit."
"Simple is better than complex."
"Complex is better than complicated."
"Flat is better than nested."
"Sparse is better than dense."
"Readability counts."
]
from tokenizers.models import WordLevel
from tokenizers import Tokenizer, models, normalizers, pre_tokenizers, decoders, trainers
tokenizer = Tokenizer(models.WordLevel())
trainer = trainers.WordLevelTrainer(
vocab_size=100000,
)
tokenizer.train_from_iterator(data, trainer=trainer)
tokenizer.get_vocab()
输出如下:
{'Beautiful is better than ugly.Explicit is better than implicit.Simple is better than complex.Complex is better than complicated.Flat is better than nested.Sparse is better than dense.Readability counts.': 0}
请解释我做错了什么......
解决方案
您的数据定义不正确,您的数据的 len() 是一。它需要逗号,如下所示:
data = [
"Beautiful is better than ugly.",
"Explicit is better than implicit.",
"Simple is better than complex.",
"Complex is better than complicated.",
"Flat is better than nested.",
"Sparse is better than dense.",
"Readability counts.",
]
此外,您想传入一系列序列,您可以使用 map() 函数来应用 split(),如下所示:
tokenizer.train_from_iterator(map(lambda x: x.split(), data), trainer=trainer)
推荐阅读
- android - 约束布局中的响应问题
- python - python中字典内的列表
- arrays - 如何在 Swift 中合并结构数组
- javascript - 如何让用户能够在受限功能后自动向下滚动
- python - 仅在第一次导入 python 时运行代码
- json - 如何按键对 JSON 进行排序
- android - 样式中的Android按钮背景不适用
- reactjs - AWS Amplify React HOC with Authenticator signupConfig
- java - 我不能将放置在可绘制文件夹中的 .xml 文件用于我的应用程序的背景
- javascript - ipcRenderer 没有回复