python - nltk 的 RegexpParser 中的递归
问题描述
grammar = r"""
NP: {<DT|JJ|NN.*>+} # ...
"""
我想扩展NP(名词短语)以包括由CC(并列连词:和)或, (逗号)连接的多个NP ,以捕获名词短语,例如:
- 房子和树
- 苹果、橙子和芒果
- 汽车、房子和飞机
我无法让我修改后的语法将它们捕获为单个NP:
import nltk
grammar = r"""
NP: {<DT|JJ|NN.*>+(<CC|,>+<NP>)?}
"""
sentence = 'The house and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
结果是:
(S (NP The/DT house/NN) and/CC (NP tree/NN))
我尝试将NP移到开头:NP: {(<NP><CC|,>+)?<DT|JJ|NN.*>+}
但我得到了相同的结果
(S (NP The/DT house/NN) and/CC (NP tree/NN))
解决方案
让我们从小处着手,正确地捕捉 NP(名词短语):
import nltk
grammar = r"""
NP: {<DT|JJ|NN.*>+}
"""
sentence = 'The house and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
[出去]:
(S (NP The/DT house/NN) and/CC (NP tree/NN))
现在让我们试着抓住它and/CC
。只需添加一个更高级别的短语来重用<NP>
规则:
grammar = r"""
NP: {<DT|JJ|NN.*>+}
CNP: {<NP><CC><NP>}
"""
sentence = 'The house and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
[出去]:
(S (CNP (NP The/DT house/NN) and/CC (NP tree/NN)))
现在我们捕捉NP CC NP
到了短语,让我们花点时间看看它是否捕捉到逗号:
grammar = r"""
NP: {<DT|JJ|NN.*>+}
CNP: {<NP><CC|,><NP>}
"""
sentence = 'The house, the bear and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
现在我们看到它仅限于捕获第一个左边界NP CC|, NP
并单独留下最后一个 NP。
由于我们知道连词短语在英语中具有左界连词和右界 NP,即CC|, NP
,例如and the tree
,我们看到该CC|, NP
模式是重复的,因此我们可以将其用作中间表示。
grammar = r"""
NP: {<DT|JJ|NN.*>+}
XNP: {<CC|,><NP>}
CNP: {<NP><XNP>+}
"""
sentence = 'The house, the bear and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
[出去]:
(S
(CNP
(NP The/DT house/NN)
(XNP ,/, (NP the/DT bear/NN))
(XNP and/CC (NP tree/NN))))
最终,CNP
(Conjunctive NPs)语法捕获了英语中的链式名词短语连词,甚至是复杂的,例如
import nltk
grammar = r"""
NP: {<DT|JJ|NN.*>+}
XNP: {<CC|,><NP>}
CNP: {<NP><XNP>+}
"""
sentence = 'The house, the bear, the green house and a tree went to the park or the river.'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
[出去]:
(S
(CNP
(NP The/DT house/NN)
(XNP ,/, (NP the/DT bear/NN))
(XNP ,/, (NP the/DT green/JJ house/NN))
(XNP and/CC (NP a/DT tree/JJ)))
went/VBD
to/TO
(CNP (NP the/DT park/NN) (XNP or/CC (NP the/DT river/NN)))
./.)
如果您只是对提取名词短语感兴趣,请参阅How to Traverse an NLTK Tree object? :
noun_phrases = []
def traverse_tree(tree):
if tree.label() == 'CNP':
noun_phrases.append(' '.join([token for token, tag in tree.leaves()]))
for subtree in tree:
if type(subtree) == nltk.tree.Tree:
traverse_tree(subtree)
return noun_phrases
sentence = 'The house, the bear, the green house and a tree went to the park or the river.'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
traverse_tree(chunkParser.parse(tagged))
[出去]:
['The house , the bear , the green house and a tree', 'the park or the river']
推荐阅读
- visual-studio - 如何在容器中成功安装 Visual Studio Build Tools 2019?
- c - 如果找到任何带有 0 的元素,则添加 0 并将其推送到最后一个元素
- sql - Spark 在一行中合并行
- spring-boot - 不支持 Spring Cloud 合同内容类型
- javascript - MIME 类型的 Outlook 电子邮件
- bash - 当 $0 不工作时,如何使 ${BASH_SOURCE[0]} 在 .zsh 中工作?
- python - 根据规范正确读取二进制文件需要哪些 Python 模块?
- python - 我创建了命令 Ctrl-E 来运行 Jupyter 笔记本中的所有单元格,但它不起作用
- pari-gp - 在 Pari/GP 中计算 Goldbach 分区的最快方法
- python - 素数,帮助理解平方根的使用