python - Python:BERT Tokenizer 无法加载
问题描述
我正在研究bert-base-mutilingual-uncased
模型,但是当我尝试设置它TOKENIZER
时,config
它会抛出一个OSError
.
模型配置
class config:
DEVICE = "cuda:0"
MAX_LEN = 256
TRAIN_BATCH_SIZE = 8
VALID_BATCH_SIZE = 4
EPOCHS = 1
BERT_PATH = {"bert-base-multilingual-uncased": "workspace/data/jigsaw-multilingual/input/bert-base-multilingual-uncased"}
MODEL_PATH = "workspace/data/jigsaw-multilingual/model.bin"
TOKENIZER = transformers.BertTokenizer.from_pretrained(
BERT_PATH["bert-base-multilingual-uncased"],
do_lower_case=True)
错误
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-33-83880b6b788e> in <module>
----> 1 class config:
2 # def __init__(self):
3
4 DEVICE = "cuda:0"
5 MAX_LEN = 256
<ipython-input-33-83880b6b788e> in config()
11 TOKENIZER = transformers.BertTokenizer.from_pretrained(
12 BERT_PATH["bert-base-multilingual-uncased"],
---> 13 do_lower_case=True)
/opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, *inputs, **kwargs)
1138
1139 """
-> 1140 return cls._from_pretrained(*inputs, **kwargs)
1141
1142 @classmethod
/opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
1244 ", ".join(s3_models),
1245 pretrained_model_name_or_path,
-> 1246 list(cls.vocab_files_names.values()),
1247 )
1248 )
OSError: Model name 'workspace/data/jigsaw-multilingual/input/bert-base-multilingual-uncased' was not
found in tokenizers model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking,
bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc,
bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1,
wietsedv/bert-base-dutch-cased).
We assumed 'workspace/data/jigsaw-multilingual/input/bert-base-multilingual-uncased' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such
vocabulary files at this path or url.
正如我可以解释的错误,它表示vocab.txt
在给定位置找不到该文件,但实际上它存在。
以下是文件夹中可用的bert-base-multilingual-uncased
文件:
config.json
pytorch_model.bin
vocab.txt
我是新手bert
,所以我不确定是否有不同的方法来定义标记器。
解决方案
我认为这应该有效:
from transformers import BertTokenizer
TOKENIZER = BertTokenizer.from_pretrained('bert-base-multilingual-uncased', do_lower_case=True)
它将从 huggingface 下载分词器。
推荐阅读
- react-native - React 本机 af-video-player 提供没有 proptype
- php - 联系表格 7 - Wordpress
- webview - 当询问页面请求方法是否为 POST 时,UWP WebView 控件不刷新
- reactjs - 如何在移动状态中的项目或使用地图呈现的道具时添加过渡
- php - 如何避免 HTTP 标头注入攻击
- node.js - 升级到 Gulp 4 时发出异步完成警告
- android - Google Places API 如何在内部为 android/ios 工作?
- python - 使用 Python 的 PIL 删除图像背景并创建透明图像
- laravel - 无法在 Laravel 中打开使用表单请求验证的表单
- sql - 在 SQL 中替换和拆分字符串并从值创建变量