首页 > 解决方案 > Huggingface Tokenizer 对象不可调用

问题描述

我正在创建一个深度学习代码,将文本嵌入到基于 BERT 的嵌入中。我在之前运行良好的代码中看到了意外问题。以下是片段:

sentences = ["person in red riding a motorcycle", "lady cutting cheese with reversed knife"]
# Embed text using BERT model.
text_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', cache_dir="cache/")
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
print(text_tokenizer.tokenize(sentences[0]))
inputs = text_tokenizer(sentences, return_tensors="pt", padding=True)  # error comes here

错误如下:

['person', 'in', 'red', 'riding', 'a', 'motorcycle']
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_umd.py", line 198, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/amitgh/PycharmProjects/682_image_caption_errors/model/model.py", line 92, in <module>
    load_data()
  File "/Users/amitgh/PycharmProjects/682_image_caption_errors/model/model.py", line 59, in load_data
    inputs = text_tokenizer(sentences, return_tensors="pt", padding=True)
TypeError: 'DistilBertTokenizer' object is not callable

如您所见,text_tokenizer.tokenize()效果很好。我尝试强制下载标记器,甚至更改缓存目录,但效果不佳。

该代码在其他机器(朋友的笔记本电脑)上运行良好,并且在我尝试安装 torchvision 并使用 PIL 库作为图像部分之前的一段时间内也运行良好。现在它并不总是以某种方式给出这个错误。

操作系统:MacOS 11.6,使用Conda环境,python=3.9

标签: huggingface-tokenizers

解决方案


这是一个相当容易的修复。在某些时候,我从environment.yml文件中删除了转换器版本,并开始使用带有 python=3.9 的 MV 2.x,这可能不允许直接调用标记器。我再次添加了MV,并在yml文件中transformers=4.11.2添加了频道。conda-forge之后,我能够克服这个错误。


推荐阅读