首页 > 解决方案 > SpaCy 使用自定义 Sentencizer 错误将模型保存到磁盘

问题描述

我知道有人问过类似的问题:

Spacy自定义句子拆分

SpaCy 中的自定义句子边界检测

但我的情况有点不同。我想从 spacy Sentencizer() 继承:

from spacy.pipeline import Sentencizer

class MySentencizer(Sentencizer):
    def __init__(self):
        self.tok = create_mySentencizer() # returning the sentences

    def __call__(self, *args, **kwargs):
        doc = args[0]
        for tok in doc:
            # do set the boundaries with tok.is_sent_start 
        return doc

doc = nlp("Text and so on. Another sentence.") 如果我在更新模型后调用,即使拆分也可以正常工作 :

  nlp = spacy.load("some_model")
  sentencizer = MySentencizer()
  nlp.add_pipe(sentencizer, before="parser")
  # update model 

当我想保存经过训练的模型时:

nlp.to_disk("path/to/my/model")

我收到以下错误:

AttributeError: 'MySentencizer' object has no attribute 'punct_chars'

相反,如果我使用 nlp.add_pipe(nlp.create_pipe('sentencizer')) 则不会发生错误。我想知道我应该在什么时候设置 punct_chars 属性。它应该是从超类继承的吗?

如果我从课堂上替换 Sentencizer 并根据第一篇文章做对象,它可以工作,但我可能会丢失一些有价值的信息,例如 punct_chars?

提前感谢您的帮助。

克里斯

标签: pythonoopnlpspacy

解决方案


以下应该做(注super(MySentencizer, self).__init__()):

import spacy
from spacy.pipeline import Sentencizer

class MySentencizer(Sentencizer):
    def __init__(self):
        super(MySentencizer, self).__init__() 

    def __call__(self, *args, **kwargs):
        doc = args[0]
        for tok in doc:
            tok.is_sent_start = True if tok.orth == "." else False
        return doc

nlp = spacy.load("en_core_web_md")
sentencizer = MySentencizer()
nlp.add_pipe(sentencizer, before="parser")

nlp.to_disk("model")

推荐阅读