python - 希望从句子中提取复合名词-形容词对。所以,基本上我想要类似的东西:
问题描述
对于形容词:
"The company's customer service was terrible."
{customer service, terrible}
对于动词:
"They kept increasing my phone bill"
{phone bill, increasing}
但是,我正在尝试使用 spacy 查找与多标记短语/复合名词(例如“客户服务”)相对应的 adj 和动词。
我不确定如何使用 spacy、nltk 或任何其他预打包的自然语言处理软件来做到这一点,我将不胜感激!
解决方案
对于像这样的简单示例,您可以使用 spaCy 的依赖项解析和一些简单的规则。
首先,要识别与给出的示例相似的多词名词,您可以使用“复合”依赖。用 spaCy 解析文档(例如句子)后,使用标记的dep_
属性来查找它的依赖关系。
例如,这句话有两个复合名词:
“复合依赖标识复合名词。”
每个令牌及其依赖关系如下所示:
import spacy
import pandas as pd
nlp = spacy.load('en')
example_doc = nlp("The compound dependency identifies compound nouns.")
for tok in example_doc:
print(tok.i, tok, "[", tok.dep_, "]")
>>>0 The [ det ]
>>>1 compound [ compound ]
>>>2 dependency [ nsubj ]
>>>3 identifies [ ROOT ]
>>>4 compound [ compound ]
>>>5 nouns [ dobj ]
>>>6 . [ punct ]
for tok in [tok for tok in example_doc if tok.dep_ == 'compound']: # Get list of
compounds in doc
noun = example_doc[tok.i: tok.head.i + 1]
print(noun)
>>>compound dependency
>>>compound nouns
以下函数适用于您的示例。但是,它可能不适用于更复杂的句子。
adj_doc = nlp("The company's customer service was terrible.")
verb_doc = nlp("They kept increasing my phone bill")
def get_compound_pairs(doc, verbose=False):
"""Return tuples of (multi-noun word, adjective or verb) for document."""
compounds = [tok for tok in doc if tok.dep_ == 'compound'] # Get list of compounds in doc
compounds = [c for c in compounds if c.i == 0 or doc[c.i - 1].dep_ != 'compound'] # Remove middle parts of compound nouns, but avoid index errors
tuple_list = []
if compounds:
for tok in compounds:
pair_item_1, pair_item_2 = (False, False) # initialize false variables
noun = doc[tok.i: tok.head.i + 1]
pair_item_1 = noun
# If noun is in the subject, we may be looking for adjective in predicate
# In simple cases, this would mean that the noun shares a head with the adjective
if noun.root.dep_ == 'nsubj':
adj_list = [r for r in noun.root.head.rights if r.pos_ == 'ADJ']
if adj_list:
pair_item_2 = adj_list[0]
if verbose == True: # For trying different dependency tree parsing rules
print("Noun: ", noun)
print("Noun root: ", noun.root)
print("Noun root head: ", noun.root.head)
print("Noun root head rights: ", [r for r in noun.root.head.rights if r.pos_ == 'ADJ'])
if noun.root.dep_ == 'dobj':
verb_ancestor_list = [a for a in noun.root.ancestors if a.pos_ == 'VERB']
if verb_ancestor_list:
pair_item_2 = verb_ancestor_list[0]
if verbose == True: # For trying different dependency tree parsing rules
print("Noun: ", noun)
print("Noun root: ", noun.root)
print("Noun root head: ", noun.root.head)
print("Noun root head verb ancestors: ", [a for a in noun.root.ancestors if a.pos_ == 'VERB'])
if pair_item_1 and pair_item_2:
tuple_list.append((pair_item_1, pair_item_2))
return tuple_list
get_compound_pairs(adj_doc)
>>>[(customer service, terrible)]
get_compound_pairs(verb_doc)
>>>[(phone bill, increasing)]
get_compound_pairs(example_doc, verbose=True)
>>>Noun: compound dependency
>>>Noun root: dependency
>>>Noun root head: identifies
>>>Noun root head rights: []
>>>Noun: compound nouns
>>>Noun root: nouns
>>>Noun root head: identifies
>>>Noun root head verb ancestors: [identifies]
>>>[(compound nouns, identifies)]
推荐阅读
- java - 升级 Gradle
- javascript - 是否可以在反应中卸载动态 css 导入?
- excel - 使用VBA将Word表格每行/每列中的第一行文本导入Excel
- php - 如何在mysql表中插入多个数组
- python - 创建没有 NaN 的 pandas MultiIndex 数据框
- symfony - 无法添加全局树枝
- javascript - 对象的验证检查?
- ios - ios13暗模式更改不被tableview Cell识别?
- google-apps-script - 尝试使用 appscript 在多张纸上制作三层依赖下拉菜单
- git - TFVC 到 TF GIT 迁移:回购组织,包括多种解决方案和构建