首页 > 解决方案 > 将函数应用于组中的第一个元素,然后重新合并

问题描述

(对不起,我意识到它不是很具有描述性的标题)

给定如下数据集:

       word  entity
0   Charlie      1
1        p.      1
2    Nelson      1
3     loves   None
4      Dana      2
5        c.      2
6  anderson      2
7       and   None
8     james      3

我想将一个函数(例如 get_gender())应用于每个实体的第一个元素(我想我是某种 groupby)

为了得到这样的东西:

       word entity gender
0   Charlie      1      m
1        p.      1   None
2    Nelson      1   None
3     loves   None   None
4      Dana      2      f
5        c.      2   None
6  anderson      2   None
7       and   None   None
8     james      3      m

最后填充每个实体的缺失行以获得

       word entity gender
0   Charlie      1      m
1        p.      1      m
2    Nelson      1      m
3     loves   None   None
4      Dana      2      f
5        c.      2      f
6  anderson      2      f
7       and   None   None
8     james      3      m

这是一些用于生成上述数据框的代码

import pandas as pd
df  = pd.DataFrame([("Charlie", "p.", "Nelson", "loves", "Dana", "c.", "anderson", "and", "james"), (1,1,1, None, 2,2,2, None, 3)]).transpose()
df.columns = ["word", "entity"]

我正在使用的当前“解决方案”是:

import gender_guesser.detector as gender
d = gender.Detector() 
# Detect gender in of the names in word. However this one if applied to all of the entity (including last names, furthermore one entity can be multiple genders (depending on e.g. their middle name)
df['gender'].loc[(df['entity'].isnull() == False)] = df['word'].loc[(df['entity'].isnull() == False)].apply(lambda string: d.get_gender(string.lower().capitalize()))

标签: pythonpython-3.xpandas

解决方案


groupby 之后没有顺序,因此您无法从组中获取第一个元素。在这种情况下,您可以按实体分组并从每个组中选择 not None 值,然后加入原始 DataFrame。

df  = pd.DataFrame([
    ("Charlie", "p.", "Nelson", "loves", "Dana", "c.", "anderson", "and", "james")
    , (1,1,1, None, 2,2,2, None, 3)
    , ('m', None, None, None, 'f', None, None, None, 'm')]).transpose()
df.columns = ["word", "entity", "gender"]

df_g = df.groupby('entity').agg({'gender': lambda x: max(filter(None, x))}).reset_index()

pd.merge(df, df_g, on='entity', suffixes=('_x', ''))[['word', 'entity', 'gender']]

但请注意,在 之后groupby,其实体消失None的项目。


推荐阅读