首页 > 解决方案 > Python:BERT Tokenizer 无法加载

问题描述

我正在研究bert-base-mutilingual-uncased模型,但是当我尝试设置它TOKENIZER时,config它会抛出一个OSError.

模型配置

class config: 
    DEVICE = "cuda:0"
    MAX_LEN = 256
    TRAIN_BATCH_SIZE = 8
    VALID_BATCH_SIZE = 4
    EPOCHS = 1

    BERT_PATH = {"bert-base-multilingual-uncased": "workspace/data/jigsaw-multilingual/input/bert-base-multilingual-uncased"}
    MODEL_PATH = "workspace/data/jigsaw-multilingual/model.bin"

    TOKENIZER = transformers.BertTokenizer.from_pretrained(
            BERT_PATH["bert-base-multilingual-uncased"], 
            do_lower_case=True)

错误

    ---------------------------------------------------------------------------
    OSError                                   Traceback (most recent call last)
    <ipython-input-33-83880b6b788e> in <module>
    ----> 1 class config:
          2 #     def __init__(self):
          3 
          4         DEVICE = "cuda:0"
          5         MAX_LEN = 256
    
    <ipython-input-33-83880b6b788e> in config()
         11         TOKENIZER = transformers.BertTokenizer.from_pretrained(
         12             BERT_PATH["bert-base-multilingual-uncased"],
    ---> 13             do_lower_case=True)
    
    /opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, *inputs, **kwargs)
       1138 
       1139         """
    -> 1140         return cls._from_pretrained(*inputs, **kwargs)
       1141 
       1142     @classmethod
    
    /opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
       1244                     ", ".join(s3_models),
       1245                     pretrained_model_name_or_path,
    -> 1246                     list(cls.vocab_files_names.values()),
       1247                 )
       1248             )
    
    OSError: Model name 'workspace/data/jigsaw-multilingual/input/bert-base-multilingual-uncased' was not  
 found in tokenizers model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking,   
bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc,   
bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1,     
wietsedv/bert-base-dutch-cased). 

We assumed 'workspace/data/jigsaw-multilingual/input/bert-base-multilingual-uncased' was a path, a model   identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such  
 vocabulary files at this path or url.

正如我可以解释的错误,它表示vocab.txt在给定位置找不到该文件,但实际上它存在。

以下是文件夹中可用的bert-base-multilingual-uncased文件:

我是新手bert,所以我不确定是否有不同的方法来定义标记器。

标签: pythonnlppytorchbert-language-modelhuggingface-transformers

解决方案


我认为这应该有效:

from transformers import BertTokenizer
TOKENIZER = BertTokenizer.from_pretrained('bert-base-multilingual-uncased', do_lower_case=True)

它将从 huggingface 下载分词器。


推荐阅读