python - 将标记器添加到空白英语 spacy 管道
问题描述
我很难弄清楚如何从 spacy V3 中的内置模型一点一点地组装 spacy 管道。我已经下载了en_core_web_sm
模型,可以用nlp = spacy.load("en_core_web_sm")
. 示例文本的处理就像这样工作得很好。
现在我想要的是从空白构建一个英语管道并一点一点地添加组件。我不想加载整个en_core_web_sm
管道并排除组件。为了具体起见,假设我只想要tagger
管道中的 spacy 默认值。文档向我建议
import spacy
from spacy.pipeline.tagger import DEFAULT_TAGGER_MODEL
config = {"model": DEFAULT_TAGGER_MODEL}
nlp = spacy.blank("en")
nlp.add_pipe("tagger", config=config)
nlp("This is some sample text.")
应该管用。但是我收到与以下相关的错误hashembed
:
Traceback (most recent call last):
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/language.py", line 1000, in __call__
doc = proc(doc, **component_cfg.get(name, {}))
File "spacy/pipeline/trainable_pipe.pyx", line 56, in spacy.pipeline.trainable_pipe.TrainablePipe.__call__
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/util.py", line 1507, in raise_error
raise e
File "spacy/pipeline/trainable_pipe.pyx", line 52, in spacy.pipeline.trainable_pipe.TrainablePipe.__call__
File "spacy/pipeline/tagger.pyx", line 111, in spacy.pipeline.tagger.Tagger.predict
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 315, in predict
return self._func(self, X, is_train=False)[0]
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
return self._func(self, X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
return self._func(self, X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
return self._func(self, X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/with_array.py", line 30, in forward
return _ragged_forward(
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/with_array.py", line 90, in _ragged_forward
Y, get_dX = layer(Xr.dataXd, is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
return self._func(self, X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/concatenate.py", line 44, in forward
Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/concatenate.py", line 44, in <listcomp>
Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
return self._func(self, X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
return self._func(self, X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/hashembed.py", line 61, in forward
vectors = cast(Floats2d, model.get_param("E"))
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 216, in get_param
raise KeyError(
KeyError: "Parameter 'E' for model 'hashembed' has not been allocated yet."
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-2-8e2b4cf9fd33>", line 8, in <module>
nlp("This is some sample text.")
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/language.py", line 1003, in __call__
raise ValueError(Errors.E109.format(name=name)) from e
ValueError: [E109] Component 'tagger' could not be run. Did you forget to call `initialize()`?
暗示我应该跑initialize()
。好的。如果我然后运行,nlp.initialize()
我最终会收到此错误
Traceback (most recent call last):
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-3-eeec225a68df>", line 1, in <module>
nlp.initialize()
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/language.py", line 1273, in initialize
proc.initialize(get_examples, nlp=self, **p_settings)
File "spacy/pipeline/tagger.pyx", line 271, in spacy.pipeline.tagger.Tagger.initialize
File "spacy/pipeline/pipe.pyx", line 104, in spacy.pipeline.pipe.Pipe._require_labels
ValueError: [E143] Labels for component 'tagger' not initialized. This can be fixed by calling add_label, or by providing a representative batch of examples to the component's `initialize` method.
现在我有点不知所措。哪些标签示例?我从哪里拿走它们?为什么默认模型配置不解决这个问题?我必须告诉 spacy 以en_core_web_sm
某种方式使用吗?如果是这样,我怎么能这样做而不使用spacy.load("en_core_web_sm")
和排除一大堆东西?感谢您的提示!
编辑:理想情况下,我希望能够从修改后的配置文件中仅加载管道的一部分,例如nlp = English.from_config(config)
. 我什至不能使用附带的配置文件,en_core_web_sm
因为生成的管道也需要初始化,nlp.initialize()
现在我收到了
Traceback (most recent call last):
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-67-eeec225a68df>", line 1, in <module>
nlp.initialize()
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/language.py", line 1246, in initialize
I = registry.resolve(config["initialize"], schema=ConfigSchemaInit)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/config.py", line 727, in resolve
resolved, _ = cls._make(
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/config.py", line 776, in _make
filled, _, resolved = cls._fill(
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/config.py", line 848, in _fill
getter_result = getter(*args, **kwargs)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/language.py", line 98, in load_lookups_data
lookups = load_lookups(lang=lang, tables=tables)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/lookups.py", line 30, in load_lookups
raise ValueError(Errors.E955.format(table=", ".join(tables), lang=lang))
ValueError: [E955] Can't find table(s) lexeme_norm for language 'en' in spacy-lookups-data. Make sure you have the package installed or provide your own lookup tables if no default lookups are available for your language.
暗示它没有找到所需的查找表。
解决方案
nlp.add_pipe("tagger")
添加一个新的空白/未初始化的标注器,而不是来自en_core_web_sm
或任何其他预训练管道的标注器。如果以这种方式添加标记器,则需要先对其进行初始化和训练,然后才能使用它。
source
您可以使用以下选项从现有管道添加组件:
nlp = spacy.add_pipe("tagger", source=spacy.load("en_core_web_sm"))
也就是说,标记化可能spacy.blank("en")
与源管道中的标记器所训练的不同。通常(尤其是当您离开 spacy 的预训练管道后),您还应该确保标记器设置相同,并且在排除组件的同时加载是一种简单的方法。
或者,除了nlp.add_pipe(source=)
用于 scispacy's 之类的模型之外,您还可以复制标记器设置en_core_sci_sm
,这是标记化与以下不同的管道的一个很好的示例spacy.blank("en")
:
nlp = spacy.blank("en")
source_nlp = spacy.load("en_core_sci_sm")
nlp.tokenizer.from_bytes(source_nlp.tokenizer.to_bytes())
nlp.add_pipe("tagger", source=source_nlp)
推荐阅读
- c++ - 在 openssl 中验证证书时忽略公用名
- javascript - console.error: React Native 版本不匹配
- android - 运行调试器时解析 XML 并获得不同的结果,与仅运行
- java - 在应用程序中保留模拟的按键(java)
- vb.net - 如何获取 WCF 在 IIS 上托管的虚拟路径而不是物理路径?
- javascript - Mocha 忽略现有的 .mocharc.js 配置文件
- python - 遍历具有多个值的列表并计算平均值
- single-page-application - 在单页应用程序上处理 svelte 上的 window.scrollTo
- python - 如何使用 Python 为 Telegram 机器人中的每个 InlineKeyboardButton 获取特定操作?
- c# - 如何从 Azure 云 shell C# 检查资源组的内容?