python - 在数字之前或之后的度量单位上的 spacy 规则匹配器
问题描述
我是 spacy 的新手,我正在尝试匹配某些文本中的一些测量值。我的问题是计量单位有时在值之前,有时在值之后。在其他一些情况下有不同的名称。这是一些代码:
nlp = spacy.load('en_core_web_sm')
# case 1:
text = "the surface is 31 sq"
# case 2:
# text = "the surface is sq 31"
# case 3:
# text = "the surface is square meters 31"
# case 4:
# text = "the surface is 31 square meters"
# case 5:
# text = "the surface is about 31 square meters"
# case 6:
# text = "the surface is 31 kilograms"
pattern = [
{"IS_STOP": True},
{"LOWER": "surface"},
{"LEMMA": "be", "OP": "?"},
{"LOWER": "sq", "OP": "?"},
{"LOWER": "square", "OP": "?"},
{"LOWER": "meters", "OP": "?"},
{"IS_DIGIT": True},
{"LOWER": "square", "OP": "?"},
{"LOWER": "meters", "OP": "?"},
{"LOWER": "sq", "OP": "?"}
]
doc = nlp(text)
matcher = Matcher(nlp.vocab)
matcher.add("Surface", None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
我有两个问题: 1 - 模式应该能够匹配所有案例 1 到 5,但在我的案例 1 中,输出是
4898162435462687487 Surface 0 4 the surface is 31
4898162435462687487 Surface 0 5 the surface is 31 sq
在我看来,这是一个重复的匹配。
2 - case 6 不应该匹配,而是与我的模式匹配。关于如何改进这一点的任何建议?
编辑:是否可以在模式中建立 OR 条件?就像是
pattern = [
{"POS": "DET", "OP": "?"},
{"LOWER": "surface"},
{"LEMMA": "be", "OP": "?"},
[
[{"LOWER": "sq", "OP": "?"},
{"LOWER": "square", "OP": "?"},
{"LOWER": "meters", "OP": "?"},
{"IS_ALPHA": True, "OP": "?"},
{"LIKE_NUM": True}]
OR
[{"LIKE_NUM": True},
{"LOWER": "square", "OP": "?"},
{"LOWER": "meters", "OP": "?"},
{"LOWER": "sq", "OP": "?"} ]
]
]
解决方案
您不能使用这样的 OR,但您可以为同一个标签定义不同的模式。因此,您需要两种模式,一种将数字与前面的这些词中的一个sq
或square
或meters
或组合匹配,另一种模式将数字与后面的这些词中的至少一个匹配。
代码片段:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
texts = ["the surface is 31 sq", "the surface is sq 31", "the surface is square meters 31",
"the surface is 31 square meters", "the surface is about 31 square meters", "the surface is 31 kilograms"]
pattern1 = [
{"IS_STOP": True},
{"LOWER": "surface"},
{"LEMMA": "be", "OP": "?"},
{"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"},
{"LIKE_NUM": True}
]
pattern2 = [
{"IS_STOP": True},
{"LOWER": "surface"},
{"LEMMA": "be", "OP": "?"},
{"IS_ALPHA": True, "OP": "?"},
{"LIKE_NUM": True},
{"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"}
]
matcher = Matcher(nlp.vocab, validate=True)
matcher.add("Surface", None, pattern1)
matcher.add("Surface", None, pattern2)
for text in texts:
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
输出:
4898162435462687487 Surface 0 5 the surface is 31 sq
4898162435462687487 Surface 0 5 the surface is sq 31
4898162435462687487 Surface 0 6 the surface is square meters 31
4898162435462687487 Surface 0 5 the surface is 31 square
4898162435462687487 Surface 0 6 the surface is about 31 square
该{"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"}
部分匹配一个或多个与正则表达式匹配的标记(由于"OP": "+"
):
^
- 令牌的开始(?i:
- 不区分大小写的修饰符组的开始:sq(?:uare)?
-sq
或square
|
- 或者m(?:et(?:er|re)s?)?
-m
,meter
/metre
或meters
/metres
)
- 小组结束$
- 字符串结尾(此处为标记)。
推荐阅读
- python - Matplotlib:set_rscale('log') 不适用于极坐标图
- r - 计算和绘制康托尔函数
- python - 使用 hCaptch + Cloud Flare 保护绕过网站
- javascript - 创建用户帐户后的 Firebase 配置文件
- laravel - $bucket 'boundaries' 字段必须是数组,但找到类型:字符串
- python-3.x - 打开运行另一个线程的第二个响应框架
- cryptocurrency - Uniswap V3 关于价格范围和掉期细节
- android - 如何从firebase实时数据库中对附近用户的帖子进行排序?
- php - 我想在 laravel 中使用 ajax 替换输入值
- android - 如何在 Android Studio 中配置 SDK