首页 > 解决方案 > 希望从句子中提取复合名词-形容词对。所以,基本上我想要类似的东西:

问题描述

对于形容词:

"The company's customer service was terrible."
{customer service, terrible}

对于动词:

"They kept increasing my phone bill"
{phone bill, increasing}

这是这个帖子的一个分支问题

但是,我正在尝试使用 spacy 查找与多标记短语/复合名词(例如“客户服务”)相对应的 adj 和动词。

我不确定如何使用 spacy、nltk 或任何其他预打包的自然语言处理软件来做到这一点,我将不胜感激!

标签: pythonnltkspacy

解决方案


对于像这样的简单示例,您可以使用 spaCy 的依赖项解析和一些简单的规则。

首先,要识别与给出的示例相似的多词名词,您可以使用“复合”依赖。用 spaCy 解析文档(例如句子)后,使用标记的dep_属性来查找它的依赖关系。

例如,这句话有两个复合名词:

“复合依赖标识复合名词。”

每个令牌及其依赖关系如下所示:

import spacy
import pandas as pd
nlp = spacy.load('en')

example_doc = nlp("The compound dependency identifies compound nouns.")
for tok in example_doc:
    print(tok.i, tok, "[", tok.dep_, "]")

>>>0 The [ det ]
>>>1 compound [ compound ]
>>>2 dependency [ nsubj ]
>>>3 identifies [ ROOT ]
>>>4 compound [ compound ]
>>>5 nouns [ dobj ]
>>>6 . [ punct ]
for tok in [tok for tok in example_doc if tok.dep_ == 'compound']: # Get list of 
compounds in doc
    noun = example_doc[tok.i: tok.head.i + 1]
    print(noun)
>>>compound dependency
>>>compound nouns

以下函数适用于您的示例。但是,它可能不适用于更复杂的句子。

adj_doc = nlp("The company's customer service was terrible.")
verb_doc = nlp("They kept increasing my phone bill")

def get_compound_pairs(doc, verbose=False):
    """Return tuples of (multi-noun word, adjective or verb) for document."""
    compounds = [tok for tok in doc if tok.dep_ == 'compound'] # Get list of compounds in doc
    compounds = [c for c in compounds if c.i == 0 or doc[c.i - 1].dep_ != 'compound'] # Remove middle parts of compound nouns, but avoid index errors
    tuple_list = []
    if compounds: 
        for tok in compounds:
            pair_item_1, pair_item_2 = (False, False) # initialize false variables
            noun = doc[tok.i: tok.head.i + 1]
            pair_item_1 = noun
            # If noun is in the subject, we may be looking for adjective in predicate
            # In simple cases, this would mean that the noun shares a head with the adjective
            if noun.root.dep_ == 'nsubj':
                adj_list = [r for r in noun.root.head.rights if r.pos_ == 'ADJ']
                if adj_list:
                    pair_item_2 = adj_list[0] 
                if verbose == True: # For trying different dependency tree parsing rules
                    print("Noun: ", noun)
                    print("Noun root: ", noun.root)
                    print("Noun root head: ", noun.root.head)
                    print("Noun root head rights: ", [r for r in noun.root.head.rights if r.pos_ == 'ADJ'])
            if noun.root.dep_ == 'dobj':
                verb_ancestor_list = [a for a in noun.root.ancestors if a.pos_ == 'VERB']
                if verb_ancestor_list:
                    pair_item_2 = verb_ancestor_list[0]
                if verbose == True: # For trying different dependency tree parsing rules
                    print("Noun: ", noun)
                    print("Noun root: ", noun.root)
                    print("Noun root head: ", noun.root.head)
                    print("Noun root head verb ancestors: ", [a for a in noun.root.ancestors if a.pos_ == 'VERB'])
            if pair_item_1 and pair_item_2:
                tuple_list.append((pair_item_1, pair_item_2))
    return tuple_list

get_compound_pairs(adj_doc)
>>>[(customer service, terrible)]
get_compound_pairs(verb_doc)
>>>[(phone bill, increasing)]
get_compound_pairs(example_doc, verbose=True)
>>>Noun:  compound dependency
>>>Noun root:  dependency
>>>Noun root head:  identifies
>>>Noun root head rights:  []
>>>Noun:  compound nouns
>>>Noun root:  nouns
>>>Noun root head:  identifies
>>>Noun root head verb ancestors:  [identifies]
>>>[(compound nouns, identifies)]

推荐阅读