首页 > 解决方案 > 如何在 deeppavlov (NER) Python 3 中训练模型

问题描述

首先,对于我所犯的任何新手错误,我深表歉意。但我想不通,也找不到专门用于deeppavlov (NER)库的资源。我正在尝试按照此处所述训练 ner_ontonotes_bert_mult 。我想它可以从它的检查点进行训练,让它识别一些特定的模式,比如;

"Round 23/22; 24,9 x 12,2 x 12,3"

作为

[[['Round', '23/22', ';', '24,9 x 12,2 x 12,3']], [['B-PRODUCT', 'I-PRODUCT', 'B-QUANTITY']]]

我的问题是(在我深入研究细节之前):

  1. 可能吗? 我意识到我不能使用像“Round 23/22; 24,9 x 12,2 x 12,3”这样的样本。我需要他们用完整的句子。
  2. 我在哪里可以找到与 deeppavlov 模型特别相关的更多信息?
  3. 如何训练预训练的 deeppavlov 模型来识别我的自定义模式?

我什至不明白这是否可能,但我决定放弃并准备 3 个文件.txt"train.txt"如deeppovlov网页中所述。我把它们放在文件夹下。我的数据集如下所示:"test.txt""validation.txt"'~/.deeppavlov/downloads/ontonotes/ner_ontonotes_bert_mult'

Round B-PRODUCT
23/22 I-PRODUCT
24,9 x 12,2 x 12,3 B-QUANTITY
Ring B-PRODUCT
HDFAA I-PRODUCT
12,7 x 10 B-QUANTITY

等等......这是我试图训练它的代码:

import os
# Force tensorflow to use CPU instead of GPU.
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
from deeppavlov import configs, train_model
from deeppavlov.core.commands.utils import parse_config

config_dict = parse_config(configs.ner.ner_ontonotes_bert_mult)

print(config_dict['dataset_reader']['data_path'])

from deeppavlov import configs, train_model

ner_model = train_model(configs.ner.ner_ontonotes_bert_mult)

但我收到此错误:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [3] rhs shape= [37]
     [[{{node save/Assign_280}}]]

完整追溯:

2019-09-26 15:50:27.63 ERROR in 'deeppavlov.core.common.params'['params'] at line 110: Exception in <class 'deeppavlov.models.bert.bert_ner.BertNerModel'>
Traceback (most recent call last):
  File "/home/custom_user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/home/custom_user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/custom_user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [3] rhs shape= [37]
     [[{{node save/Assign_280}}]]

更新 2:

我意识到我不能使用像“Round 23/22; 24,9 x 12,2 x 12,3”这样的样本。我需要他们用完整的句子。

更新:

由于我的数据集,这似乎正在发生。我的自定义数据集只有 3 个标签(和) B-PRODUCT,但预训练模型有 37 个。所有可用的标签都可以在这里的句子下找到。18 个主要标签(有36个标签)和标签(“O”表示不存在实体。))。数据集中需要存在所有 37 个标签。我能够通过添加虚拟句子来传递该错误,方法是用缺失的标签标记它们。这是一个糟糕的解决方法,因为我愿意破坏我自己的数据集。我仍在寻找一种“合乎逻辑”的训练方式……I-PRODUCTB-QUANTITY"The list of available tags and their descriptions are presented below."BIO

PS:现在我收到此错误。

Traceback (most recent call last):
  File "/home/custom_user/.PyCharm2019.2/config/scratches/scratch_9.py", line 13, in <module>
    ner_model = train_model(configs.ner.ner_ontonotes_bert_mult)
  File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/__init__.py", line 31, in train_model
    train_evaluate_model_from_config(config, download=download, recursive=recursive)
  File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/core/commands/train.py", line 121, in train_evaluate_model_from_config
    trainer.train(iterator)
  File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/core/trainers/nn_trainer.py", line 294, in train
    self.train_on_batches(iterator)
  File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/core/trainers/nn_trainer.py", line 234, in train_on_batches
    self._validate(iterator)
  File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/core/trainers/nn_trainer.py", line 150, in _validate
    metrics = list(report['metrics'].items())
AttributeError: 'NoneType' object has no attribute 'items'

标签: python-3.xtensorflownamed-entity-recognition

解决方案


这里至少有两个问题:
1. 而不是validation.txt应该有一个valid.txt文件;
2. 您正在尝试重新训练在具有不同标签集的不同数据集上预训练的模型,这是没有必要的。

要从头开始训练模型,您可以执行以下操作:

import json
from deeppavlov import configs, build_model, train_model

with configs.ner.ner_ontonotes_bert_mult.open(encoding='utf8') as f:
    ner_config = json.load(f)

ner_config['dataset_reader']['data_path'] = '~/my_data_dir/'  # directory with train.txt, valid.txt and test.txt files
ner_config['metadata']['variables']['NER_PATH'] = '~/where_to_save_the_model/'
ner_config['metadata']['download'] = [ner_config['metadata']['download'][-1]]  # do not download the pretrained ontonotes model

ner_model = train_model(ner_config, download=True)



可能出错的另一件事是标记化:"Round 23/22; 24,9 x 12,2 x 12,3"将被模型拆分为['Round', '23', '/', '22', ';', '24', ',', '9', 'x', '12', ',', '2', 'x', '12', ',', '3']和 not ['Round', '23/22', ';', '24,9 x 12,2 x 12,3']

但是您可以预先标记您的文本:

ner_model([['Round', '23/22', ';', '24,9 x 12,2 x 12,3']])

推荐阅读