python - 理解 Spacy 的名词块解析器
问题描述
我正在查看 Spacy 提取名词块的代码(转载如下),但我不明白评论的部分:
防止生成嵌套块
for i, word in enumerate(doclike):
if word.pos not in (NOUN, PROPN, PRON):
continue
# Prevent nested chunks from being produced
if word.left_edge.i <= prev_end:
continue
我知道我们正在尝试避免嵌套块,但是有人可以向我解释一下这些left_edge
方法是如何实现的吗?这是如何跟踪的开始/结束索引noun-chunk
?
谢谢!
# coding: utf8
from __future__ import unicode_literals
from ...symbols import NOUN, PROPN, PRON
from ...errors import Errors
def noun_chunks(doclike):
"""
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
"""
labels = [
"nsubj",
"dobj",
"nsubjpass",
"pcomp",
"pobj",
"dative",
"appos",
"attr",
"ROOT",
]
doc = doclike.doc # Ensure works on both Doc and Span.
if not doc.is_parsed:
raise ValueError(Errors.E029)
np_deps = [doc.vocab.strings.add(label) for label in labels]
conj = doc.vocab.strings.add("conj")
np_label = doc.vocab.strings.add("NP")
prev_end = -1
for i, word in enumerate(doclike):
if word.pos not in (NOUN, PROPN, PRON):
continue
# Prevent nested chunks from being produced
if word.left_edge.i <= prev_end:
continue
if word.dep in np_deps:
prev_end = word.i
yield word.left_edge.i, word.i + 1, np_label
elif word.dep == conj:
head = word.head
while head.dep == conj and head.head.i < head.i:
head = head.head
# If the head is an NP, and we're coordinated to it, we're an NP
if head.dep in np_deps:
prev_end = word.i
yield word.left_edge.i, word.i + 1, np_label
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}
解决方案
有效名词块可以是较大名词块的一部分。例子:
>>> list(nlp("We went to the clean grocery store").noun_chunks)
[We, the clean grocery store]
>>> list(nlp("We went to clean grocery store").noun_chunks)
[We, clean grocery store]
>>> list(nlp("We went to grocery store").noun_chunks)
[We, grocery store]
所以你问的代码是防止list(nlp("We went to the clean grocery store").noun_chunks)
返回[We, the clean grocery store, clean grocery store, grocery store]
推荐阅读
- rust - Rust 中的部分应用程序宏,工作但
- c++ - 使用 Visual Studio 在 Unreal 中创建一个新的 C++ 类给了我太多错误
- java - 由于版本仅通过终端构建失败
- javascript - 在 discord.js 中使用角色反应获得警告
- c++ - 将 2 个长值相除会在 C++ 中产生错误的输出
- css - Thunderbird - css - 主/(“浏览器”?)页面样式 - bg img 设置
- linux - 在 windows 和 linux 之间使用相同的 git 存储库会导致额外的提交
- web - 为什么我可以在网站上看到信息,但在网站的源代码页上看不到?
- firebase - 如何在收到通知后导航到特定屏幕?
- xcode - CLLocationManager:在 iOS 13 中未调用 didChangeAuthorizationStatus