python - 使用 Stanza 和 CoreNLPClient 提取名词短语
问题描述
我正在尝试使用 Stanza(使用 Stanford CoreNLP)从句子中提取名词短语。这只能通过 Stanza 中的 CoreNLPClient 模块来完成。
# Import client module
from stanza.server import CoreNLPClient
# Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001
client = CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner', 'parse'], memory='4G', endpoint='http://localhost:9001')
这是一个句子的例子,我正在使用tregrex
客户端中的函数来获取所有的名词短语。Tregex
函数dict of dicts
在 python 中返回一个。因此,我需要在将输出tregrex
传递给Tree.fromstring
NLTK 中的函数之前对其进行处理,以正确地将名词短语提取为字符串。
pattern = 'NP'
text = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
matches = client.tregrex(text, pattern) ``
因此,我想出了stanza_phrases
必须循环遍历NLTKdict of dicts
的输出tregrex
和正确格式的方法。Tree.fromstring
def stanza_phrases(matches):
Nps = []
for match in matches:
for items in matches['sentences']:
for keys,values in items.items():
s = '(ROOT\n'+ values['match']+')'
Nps.extend(extract_phrase(s, pattern))
return set(Nps)
生成一棵树供 NLTK 使用
from nltk.tree import Tree
def extract_phrase(tree_str, label):
phrases = []
trees = Tree.fromstring(tree_str)
for tree in trees:
for subtree in tree.subtrees():
if subtree.label() == label:
t = subtree
t = ' '.join(t.leaves())
phrases.append(t)
return phrases
这是我的输出:
{'Albert Einstein', 'He', 'a German-born theoretical physicist', 'relativity', 'the theory', 'the theory of relativity'}
有没有一种方法可以用更少的行数(尤其是方法)来提高stanza_phrases
代码extract_phrase
效率
解决方案
from stanza.server import CoreNLPClient
# get noun phrases with tregex
def noun_phrases(_client, _text, _annotators=None):
pattern = 'NP'
matches = _client.tregex(_text,pattern,annotators=_annotators)
print("\n".join(["\t"+sentence[match_id]['spanString'] for sentence in matches['sentences'] for match_id in sentence]))
# English example
with CoreNLPClient(timeout=30000, memory='16G') as client:
englishText = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
print('---')
print(englishText)
noun_phrases(client,englishText,_annotators="tokenize,ssplit,pos,lemma,parse")
# French example
with CoreNLPClient(properties='french', timeout=30000, memory='16G') as client:
frenchText = "Je suis John."
print('---')
print(frenchText)
noun_phrases(client,frenchText,_annotators="tokenize,ssplit,mwt,pos,lemma,parse")
推荐阅读
- python - 将 pack_padded_sequence 与 Transformer 一起使用会出错
- image-processing - 扁平化图像旋转矩阵
- sql - 如何使用 with CTE (WITH CTE- common_table_expression) 更新雪花中的语句?
- sql - SQL Server 聚合窗口的区别
- jquery - 标签功能的jQuery切换或滑动?
- angular - 组件 url 显示是应用中任意 url 地址的首页
- delphi - 如何在 Delphi 11 Alexandria 中选择 Android 平台?
- macos - 在 Mac OS 上安装 Hadoop 的问题
- python - 从一个excel表的列中的单元格中获取值,并检查该值是否存在于其他excel表的列中
- r - 如何从数据集中迭代地提取 compare_means 值