首页 > 解决方案 > 如何在没有语法形式的情况下获取 nltk 树的节点?

问题描述

我设法创建了一个从 spaCy 创建树的类,我想在节点中只保留单词而不是语法的全部内容。也就是说有 startstart_VB_ROOT

概括地说,例如,碧昂丝是什么时候开始流行的? 输入是

[Tree('start_VB_ROOT', ['When_WRB_advmod', 'did_VBD_aux', 'Beyonce_NNP_nsubj', Tree('becoming_VBG_xcomp', ['popular_JJ_acomp']), '?_._punct'])]

我在下面提供的函数的预期输出将是一棵树:

<class 'str'> When_WRB_advmod
son creation : When
<class 'str'> did_VBD_aux
son creation : did
<class 'str'> Beyonce_NNP_nsubj
son creation : Beyonce
<class 'nltk.tree.Tree'> (becoming_VBG_xcomp popular_JJ_acomp)
sub tree creation
son: becoming_VBG_xcomp
<class 'str'> popular_JJ_acomp
son creation popular
end of sub tree creation
<class 'str'> ?_._punct
son creation ?

这是功能

class WordTree:
    '''Tree for spaCy dependency parsing array'''
    def __init__(self, array, parent = None):
        """
        Construct a new 'WordTree' object.

        :param array: The array contening the dependency
        :param parent: The parent of the array if exists
        :return: returns nothing
        """
        self.parent = []
        self.children = []
        self.data = array

        for element in array[0]:
            print(type(element),element)
            # we check if we got a subtree
            if type(element) is Tree:
                print("sub tree creation")
                self.children.append(element.label())
                print("son:",element.label())
                t = WordTree([element],element.label()) # should I verify if parent is empty ?
                print("end of sub tree creation")
            # else if we have a string we create a son
            elif type(element) is str:
                print("son creation",element)
                self.children.append(element)
            # in other case we have a problem
            else:
                print("issue?")
                break

目前给出以下输出:

<class 'str'> When_WRB_advmod
son creation When_WRB_advmod
<class 'str'> did_VBD_aux
son creation did_VBD_aux
<class 'str'> Beyonce_NNP_nsubj
son creation Beyonce_NNP_nsubj
<class 'nltk.tree.Tree'> (becoming_VBG_xcomp popular_JJ_acomp)
sub tree creation
son: becoming_VBG_xcomp
<class 'str'> popular_JJ_acomp
son creation popular_JJ_acomp
end of sub tree creation
<class 'str'> ?_._punct
son creation ?_._punct

标签: python-3.xspacy

解决方案


首先,请注意问题中的 SpaCy“语法形式”实际上是附加了 POS 标签和依赖标签的表面标记。Tree.leaves()在这种情况下,您应该只Tree.label()检索nltk.

但是操作 SpaCy 解析器的原始输出会更容易,而不是像问题中那样搞乱数据格式。

请参阅如何遍历 NLTK 树对象?在继续之前,在进行深度优先遍历时考虑递归(没有类)。

对于未来的读者,在继续下面的答案之前阅读问题中的评论。


如果您想简单地从叶子和标签中删除 POS 和依赖标签,请尝试以下操作:

from nltk import Tree

parse = Tree('start_VB_ROOT', 
                 ['When_WRB_advmod', 'did_VBD_aux', 'Beyonce_NNP_nsubj', 
                 Tree('becoming_VBG_xcomp', 
                      ['popular_JJ_acomp']), 
                  '?_._punct']
            )

def traverse_tree(tree, is_subtree=False):
    for subtree in tree:
        print(type(subtree), subtree)
        if type(subtree) == Tree:
            # Iterate through the depth of the subtree.
            print('sub tree creation')
            traverse_tree(subtree, True)
            print('end of sub tree creation')
        elif type(subtree) == str:
            surface_form = subtree.split('_')[0]
            print('son creation:', surface_form)

traverse_tree(parse)

[出去]:

<class 'str'> When_WRB_advmod
son creation: When
<class 'str'> did_VBD_aux
son creation: did
<class 'str'> Beyonce_NNP_nsubj
son creation: Beyonce
<class 'nltk.tree.Tree'> (becoming_VBG_xcomp popular_JJ_acomp)
sub tree creation
<class 'str'> popular_JJ_acomp
son creation: popular
end of sub tree creation
<class 'str'> ?_._punct
son creation: ?

推荐阅读