首页 > 解决方案 > 如何从文本中提取所有可能的名词短语

问题描述

我想在文本中自动提取一些理想的概念(名词短语)。我的计划是提取所有名词短语,然后将它们标记为两种分类(即,合意的短语和不合意的短语)。之后,训练一个分类器对它们进行分类。我现在正在尝试的是首先提取所有可能的短语作为训练集。例如,一个句子是Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.我想得到所有的短语,如shoulder, richer mix, shoulder of richer mix, junctions, junctions of columns and beams, columns and beams, columns,beams或任何可能的。理想的短语是shoulder, junctions, junctions of columns and beams。但是我不在乎这一步的正确性,我只想先得到训练集。是否有用于此类任务的可用工具?

我在 rake_nltk 中尝试了 Rake,但结果未能包含我想要的短语(即,它没有提取所有可能的短语)

from rake_nltk import Rake
data = 'Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.'
r = Rake()
r.extract_keywords_from_text(data)
phrase = r.get_ranked_phrases()
print(phrase)enter code herenter code here

结果:(此处['richer mix', 'shoulder', 'required', 'junctions', 'items', 'described', 'columns', 'beams'] 遗漏junctions of columns and beams

我也试过短语机器,结果也漏掉了一些可取的。

import spacy
import phrasemachine
matchedList=[]
doc = nlp(data)
tokens = [token.text for token in doc]
pos = [token.pos_ for token in doc]
out = phrasemachine.get_phrases(tokens=tokens, postags=pos, output="token_spans")
print(out['token_spans'])
while len(out['token_spans']):
    start,end = out['token_spans'].pop()
    print(tokens[start:end])

结果:

[(2, 6), (4, 6), (14, 17)]
['junctions', 'of', 'columns']
['richer', 'mix']
['shoulder', 'of', 'richer', 'mix'] 

(这里遗漏了很多名词短语)

标签: pythonnlpspacynamed-entity-recognitioninformation-extraction

解决方案


您可能希望使用noun_chunks属性:

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.')

phrases = set() 
for nc in doc.noun_chunks:
    phrases.add(nc.text)
    phrases.add(doc[nc.root.left_edge.i:nc.root.right_edge.i+1].text)
print(phrases)
{'junctions of columns and beams', 'junctions', 'the items', 'a shoulder', 'columns', 'richer mix', 'beams', 'columns and beams', 'a shoulder of richer mix', 'these junctions'}

推荐阅读