首页 > 解决方案 > 使用 Python / Pandas 和可能的正则表达式从全名列表中提取姓氏

问题描述

我正在处理一个数据集,我最终得到了一个如下形式的名称列表:

s = ['DR. James Coffins',
 'Zacharias Pallefas',
 'Matthew Ebnel',
 'Ranzzith Redly',
 'GEORGE GEORGIADAKIS',
 'HARISH KUMARAN K',
 'Christiaan Kraanlen, CFA',
 'Mary K. Lein, CFA, COL',
'Alexandre Cegra,  CFA,  CAIA'
 'Anna Bely']

我必须提取姓氏并将它们放在单独的列表中(或熊猫数据框中的列)。但是,我对全名的多态性感到困惑,而且我是 Python 的新手。

可能的算法如下:

Loop through the elements of the list.  For each element:
split the element into subelements using spaces. Then:

a) If there are four or less subelements start from the beginning and 
examine the first four subelements.
a1) If the first subelement is larger than 2 letters then: If the 
second subelement is larger than one letter, return the second 
subelement. Otherwise, return the third subelement.
a2) if the first subelement is 2 letters then drop it and repeat 
step a1

标签: pythonstringlistpandas

解决方案


在跳过包含.但不在排除列表中的单词后总是抓住每行的第二个元素怎么样['dr', 'mr', 'mrs', 'mrs', 'miss', 'prof']

>>> exclude_tags = ['dr', 'mr', 'mrs', 'mrs', 'miss', 'prof']
>>> [[y for y in x.split() if '.' not in y and y.lower() not in exclude_tags][1].rstrip(',').capitalize() for x in s]
['Coffins', 'Pallefas', 'Ebnel', 'Redly', 'Georgiadakis', 'Kumaran', 'Kraanlen', 'Lein', 'Cegra']

推荐阅读