python - 基于标签分离 NLTK 子树
问题描述
我有一个 NLTK Parse 树,我想仅基于“S”标签来分离 Tree 的叶子。请注意,S 不应与叶子重叠。
鉴于句子“他赢得了 Gusher Maraton,在 30 分钟内完成。”
来自 corenlp 的树形是
tree = '(S
(NP (PRP He))
(VP
(VBD won)
(NP (DT the) (NNP Gusher) (NNP Marathon))
(, ,)
(S (VP (VBG finishing) (PP (IN in) (NP (CD 30) (NNS minutes))))))
(. .))'
想法是提取2个“S”和它们的叶子,但不相互重叠。所以预期的输出应该是“他赢得了 Gusher Marathon,”。和“在 30 分钟内完成”。
# Tree manipulation
# Extract phrases from a parsed (chunked) tree
# Phrase = tag for the string phrase (sub-tree) to extract
# Returns: List of deep copies; Recursive
def ExtractPhrases( myTree, phrase):
myPhrases = []
if (myTree.label() == phrase):
myPhrases.append( myTree.copy(True) )
for child in myTree:
if (type(child) is Tree):
list_of_phrases = ExtractPhrases(child, phrase)
if (len(list_of_phrases) > 0):
myPhrases.extend(list_of_phrases)
return myPhrases
subtexts = set()
sep_tree = ExtractPhrases( Tree.fromstring(tree), 'S')
for sep in sep_tree:
for subtree in sep.subtrees():
if subtree.label()=="S":
print(subtree)
subtexts.add(' '.join(subtree.leaves()))
#break
subtexts = list(subtexts)
print(subtexts)
我得到了输出
['He won the Gusher Marathon , finishing in 30 minutes .', 'finishing in 30 minutes']
我不想在字符串级别操作它,而是在树级别操作,所以预期的输出是-
["He won the Gusher Marathon ,.", "finishing in 30 minutes."]
解决方案
这是我的示例输入:
a =
'''
FREEDOM FROM RELIGION FOUNDATION
Darwin fish bumper stickers and assorted other atheist paraphernalia are
available from the Freedom From Religion Foundation in the US.
EVOLUTION DESIGNS
Evolution Designs sell the "Darwin fish". It's a fish symbol, like the ones
Christians stick on their cars, but with feet and the word "Darwin" written
inside. The deluxe moulded 3D plastic fish is $4.95 postpaid in the US.
'''
sentences = nltk.sent_tokenize(a)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
tagged_sentences = nltk.pos_tag_sents(sentences)
chunked_sentences = list(nltk.ne_chunk_sents(tagged_sentences))
for sent in chunked_sentences:
for subtree in sent.subtrees(filter=lambda t: t.label()=='S'):
print(subtree)
这是我的输出:
(S
(ORGANIZATION FREEDOM/NN)
(ORGANIZATION FROM/NNP)
RELIGION/NNP
FOUNDATION/NNP
Darwin/NNP
fish/JJ
bumper/NN
stickers/NNS
and/CC
assorted/VBD
other/JJ
atheist/JJ
paraphernalia/NNS
are/VBP
available/JJ
from/IN
the/DT
(ORGANIZATION Freedom/NN From/NNP Religion/NNP Foundation/NNP)
in/IN
the/DT
(GSP US/NNP)
./.)
(S
(ORGANIZATION EVOLUTION/NNP)
(ORGANIZATION DESIGNS/NNP Evolution/NNP)
Designs/NNP
sell/VB
the/DT
``/``
(PERSON Darwin/NNP)
fish/NN
''/''
./.)
(S
It/PRP
's/VBZ
a/DT
fish/JJ
symbol/NN
,/,
like/IN
the/DT
ones/NNS
Christians/NNPS
stick/VBP
on/IN
their/PRP$
cars/NNS
,/,
but/CC
with/IN
feet/NNS
and/CC
the/DT
word/NN
``/``
(PERSON Darwin/NNP)
''/''
written/VBN
inside/RB
./.)
(S
The/DT
deluxe/NN
moulded/VBD
3D/CD
plastic/JJ
fish/NN
is/VBZ
$/$
4.95/CD
postpaid/NN
in/IN
the/DT
(GSP US/NNP)
./.)
推荐阅读
- java - 使用对象列表,如何访问对象属性并打印它们?
- android - Android - LiveData 与 WeakReferences
- python - 可以使用firebase托管/数据库创建一个用python编写的网站
- microsoft-graph-api - 如何获取有关 MS Teams 用户当前通话的通知?
- c++ - 什么是大数的gcd
- codeigniter-3 - 遇到 PHP 错误:未定义属性:Login::$admin
- gnu-parallel - GNU Parallel 未跨远程主机返回输出值
- c# - 将 Easy Language 代码转换为 C# 代码会导致一些问题
- wordpress - 在woocommerce商店页面的产品下隐藏类别名称
- azure - 是否可以使用 Powershell 从 Azure Web 应用程序获取自定义域名?