首页 > 解决方案 > 将最新的 GROUPR 记录保存在数据框中

问题描述

我正在尝试清理一些数据,如果它们出现不止一次,我只需要保留最新但全部的数据。令我困惑的是,数据实际上是按“组”组织的。我在下面有一个数据框示例以及可能使其更清晰的注释:

     method  year proteins  values
0      John  2017        A      10
1      John  2017        B      20
2      John  2018        A      30 # John's method in 2018 is most recent, keep this line and drop index 0 and1
3      Kate  2018        B      11
4      Kate  2018        C      22 # Kate's method appears only in 2018 so keep both lines (index 3 and 4)
5   Patrick  2017        A      90
6   Patrick  2018        A      80
7   Patrick  2018        B      85
8   Patrick  2018        C      70
9   Patrick  2019        A      60
10  Patrick  2019        C      50 # Patrick's method in 2019 is the most recent of Patrick's so keep index 9 and 10 only

因此,所需的输出数据框与测量的蛋白质无关,但应包括所有测量的蛋白质:

     method  year proteins  values
0      John  2018        A      30
1      Kate  2018        B      11
2      Kate  2018        C      22
3   Patrick  2019        A      60
4   Patrick  2019        C      50

希望这很清楚。我已经尝试过这样的事情,my_df.sort_values('year').drop_duplicates('method', keep='last')但它给出了错误的输出。有任何想法吗?谢谢!

PS:要复制我的初始df,您可以复制以下行:

import pandas as pd
import numpy as np

methodology=["John", "John", "John", "Kate", "Kate", "Patrick", "Patrick", "Patrick", "Patrick", "Patrick", "Patrick"]
year_pract=[2017, 2017, 2018, 2018, 2018, 2017, 2018, 2018, 2018, 2019, 2019]
proteins=['A', 'B', 'A', 'B', 'C', 'A', 'A', 'B', 'C', 'A', 'C']
values=[10, 20, 30, 11, 22, 90, 80, 85, 70, 60, 50]
my_df=pd.DataFrame(zip(methodology,year,proteins,values), columns=['method','year','proteins','values'])

my_df['year']=my_df['year'].astype(str)
my_df['year']=pd.to_datetime(my_df['year'], format='%Y') # the format never works for me and this is why I add the line below
my_df['year']=my_df['year'].dt.year

标签: pythonpandasdataframe

解决方案


因为重复项是必需的,使用GroupBy.transformwithmax并按原始列比较yearwith Series.eqfor equal 和过滤boolean indexing

df = my_df[my_df['year'].eq(my_df.groupby('method')['year'].transform('max'))]
print (df)

       method  year proteins  values
2        John  2018        A      30
3        Kate  2018        B      11
4        Kate  2018        C      22
9   Patrick's  2019        A      60
10  Patrick's  2019        C      50

推荐阅读