首页 > 解决方案 > 从句子中提取/解析代词-代词和动词-名词/代词组合

问题描述

问题:
我试图从职位描述中提取专有名词列表,如下所示。

text = "Civil, Mechanical, and Industrial Engineering majors are preferred."

我想从这段文字中提取以下内容:

Civil Engineering
Mechanical Engineering
Industrial Engineering

这是问题的一种情况,因此无法使用特定于应用程序的信息。例如,我不能有一个专业列表,然后尝试检查这些专业的部分名称是否与“专业”一词一起出现在句子中,因为其他句子也需要这个。

尝试
1. 我研究了spacy 依赖解析,但在每种工程类型(土木、机械、工业)和工程一词之间没有显示父子关系。

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Civil, Mechanical, and Industrial Engineering majors are preferred.")

print( "%-15s%-15s%-15s%-15s%-30s" % ( "TEXT","DEP","HEAD TEXT","HEAD POS","CHILDREN" ) )
for token in doc:
    if not token.text in ( ',','.' ):
        print( "%-15s%-15s%-15s%-15s%-30s" % 
          ( 
              token.text 
              ,token.dep_
              ,token.head.text
              ,token.head.pos_
              ,','.join( str(c) for c in token.children )
          ) )

...输出...

TEXT DEP HEAD TEXT HEAD POS 儿童                      
Civil amod majors NOUN ,,mechanical                  
机械连接民用 ADJ ,, 和                         
和 cc 机械 PROPN                                        
工业复合工程 PROPN                                        
工程复合专业 NOUN Industrial                    
专业 nsubjpass 首选 VERB Civil,Engineering             
是 auxpass 首选动词                                         
首选ROOT首选动词专业,是,。                  
  1. 我也尝试过使用 nltk pos 标记,但我得到以下...

    import nltk nltk.pos_tag( nltk.word_tokenize( '土木、机械、工业工程专业优先。') )

[('民事','NNP'),
 (',', ','),
 ('机械','NNP'),
 (',', ','),
 ('和', '抄送'),
 ('工业','NNP'),
 (“工程”,“NNP”),
 ('专业','NNS'),
 ('是','VBP'),
 ('首选','VBN'),
 ('.', '.')]

The types of engineering and the word Engineering all come up as NNP (proper nouns), so any kind of RegexpParser pattern I can think of does not work.

Question:
Does anyone know of a way - in Python 3 - to extract these noun phrase pairings?

EDIT: Addition Examples

The following examples are similar to the first example, except these are verb-noun / verb-propernoun versions.

text="Experience with testing and automating API’s/GUI’s for desktop and native iOS/Android"

Extract:

testing API’s/GUI’s
automation API’s/GUI’s
text="Design, build, test, deploy and maintain effective test automation solutions"

Extract:

Design test automation solutions
build test automation solutions
test test automation solutions
deploy test automation solutions
维护测试自动化解决方案

标签: pythonpython-3.xnlpnltkspacy

解决方案


没有任何外部导入,并假设列表始终格式化为逗号分隔,最后一个后用可选的“and”分隔,可以编写一些正则表达式并进行一些字符串操作以获得所需的输出:

import re

test_string = "Civil, Mechanical, and Industrial Engineering majors are preferred."
result = re.search(r"(([A-Z][a-z]+, )+(and)? [A-Z][a-z]+ ([A-Z][a-z]+))+", test_string)
group_type = result.group(4)
string_list = result.group(1).rstrip(group_type).strip()
items = [i.strip().strip('and ') + ' ' + group_type for i in string_list.split(',')]

print(items)  # ['Civil Engineering', 'Mechanical Engineering', 'Industrial Engineering']

同样,这一切都是基于对列表格式的狭隘假设。如果有更多可能性,您可能需要修改正则表达式模式。


推荐阅读