首页 > 解决方案 > 在没有 NLTK 的情况下使用 Python 解析词性标记的树语料库

问题描述

我有如下的树语料库

(TOP END_OF_TEXT_UNIT)

(TOP (S (NP (DT The)
            (NNP Fulton)
            (NNP County)
            (NNP Grand)
            (NNP Jury))
        (VP (VBD said)
            (NP (NNP Friday))
            (SBAR (-NONE- 0)
                  (S (NP (DT an)
                         (NN investigation)
                         (PP (IN of)
                             (NP (NP (NNP Atlanta))
                                 (POS 's)
                                 (JJ recent)
                                 (JJ primary)
                                 (NN election))))
                     (VP (VBD produced)
                         (NP (`` ``)
                             (DT no)
                             (NN evidence)
                             ('' '')
                             (SBAR (IN that)
                                   (S (NP (DT any)
                                          (NNS irregularities))
                                      (VP (VBD took)
                                          (NP (NN place)))))))))))
     (. .))

我需要解析这棵树并转换成如下的句子形式

DT The NNP Fulton NNP County NNP Grand NNP Jury VBD said NNP Friday DT
an NN investigation ...

是否有任何算法来解析上述内容,或者我们需要使用正则表达式来执行此操作,我不想使用 NLTK 包来执行此操作。

标签: pythonregexparsingnlptext-mining

解决方案


Pyparsing 可以快速完成嵌套表达式解析。

import pyparsing as pp

LPAR, RPAR = map(pp.Suppress, "()")
expr = pp.Forward()
label = pp.Word(pp.alphas.upper()+'-') | "''" | "``" | "."
word = pp.Literal(".") | "''" | "``" | pp.Word(pp.printables, excludeChars="()")

expr <<= LPAR + label + (word | pp.OneOrMore(expr)) + RPAR

sample = """
(TOP (S (NP (DT The)
            (NNP Fulton)
            (NNP County)
            (NNP Grand)
            (NNP Jury))
        (VP (VBD said)
            (NP (NNP Friday))
            (SBAR (-NONE- 0)
                  (S (NP (DT an)
                         (NN investigation)
                         (PP (IN of)
                             (NP (NP (NNP Atlanta))
                                 (POS 's)
                                 (JJ recent)
                                 (JJ primary)
                                 (NN election))))
                     (VP (VBD produced)
                         (NP (`` ``)
                             (DT no)
                             (NN evidence)
                             ('' '')
                             (SBAR (IN that)
                                   (S (NP (DT any)
                                          (NNS irregularities))
                                      (VP (VBD took)
                                          (NP (NN place)))))))))))
     (. .))
"""

result = pp.OneOrMore(expr).parseString(sample)
print(' '.join(result))

印刷:

TOP S NP DT The NNP Fulton NNP County NNP Grand NNP Jury VP VBD said NP NNP Friday SBAR -NONE- 0 S NP DT an NN investigation PP IN of NP NP NNP Atlanta POS 's JJ recent JJ primary NN election VP VBD produced NP `` `` DT no NN evidence '' '' SBAR IN that S NP DT any NNS irregularities VP VBD took NP NN place . .

通常,像这样的解析器将用于pp.Group(expr)保留嵌套元素的分组。但是在您的情况下,由于您最终还是想要一个平面列表,我们只是将其省略 - pyparsing 的默认行为是只返回匹配字符串的平面列表。


推荐阅读