首页 > 解决方案 > Pandas Groupby 2 列中的至少 1 列匹配

问题描述

我有一个 pandas df,其中有一列名称和两列,分别代表 2 个可能的出生年份。如果至少有一个出生年份列匹配,我想按名称和出生年份分组。

FullName     BirthYr1  BirthYr2
Smith, Joe   1985      1986
Dolan, Tom   1991      1992
Smith, Alex  1984      1985
Smith, Joe   1984      1985
Dolan, Tom   1991      1992
Smith, Alex  1986      1987

BirthYr2 总是比 BirthYr1 多 1。

2 'Smith, Joe' 将被分组,因为他们都有 1985 年(1 场比赛),2 'Dolan, Tom' 将被分组,因为两列相同(2 场比赛),而 2 'Smith, Alex '不会被分组,因为它们没有任何匹配项。

一旦我弄清楚这一点,我计划使用 ngroup() 为每个组分配一个唯一的 ID。

标签: pythonpandas

解决方案


这感觉过于复杂,但我认为它实现了你正在寻找的东西。假设您的起始 DataFrame 名为df

# "Melt" the birth year columns such that each value is given its own
# row. Throw away the redundant column names BirthYr1 and BirthYr2, 
# since their values are equally important to us.

melted = df.melt(id_vars='FullName', value_name='BirthYr').drop(columns='variable')                                                               
melted
       FullName  BirthYr
0    Smith, Joe     1985
1    Dolan, Tom     1991
2   Smith, Alex     1984
3    Smith, Joe     1984
4    Dolan, Tom     1991
5   Smith, Alex     1986
6    Smith, Joe     1986
7    Dolan, Tom     1992
8   Smith, Alex     1985
9    Smith, Joe     1985
10   Dolan, Tom     1992
11  Smith, Alex     1987


# Group by fullname, then birth year.

grouped = melted.groupby(['FullName', 'BirthYr']).size() 
grouped
FullName     BirthYr
Dolan, Tom   1991       2
             1992       2
Smith, Alex  1984       1
             1985       1
             1986       1
             1987       1
Smith, Joe   1984       1
             1985       2
             1986       1
dtype: int64


# Any group with more than one member represents a match.

grouped[grouped > 1].reset_index()['FullName'].unique()                                                                     
array(['Dolan, Tom', 'Smith, Joe'], dtype=object)

推荐阅读