首页 > 解决方案 > 在不同列的重复项中保留具有最高值的行

问题描述

我有一个像这样的 pandas 数据框,其中我可以有相同的 long 和 lat 组合的行:

初始df:

   lon  lat         name  value protection      a      b         c  score
0   20   10       canada    563        NaN    cat    dog  elephant   20.0
1   30   10       canada     65        NaN   lion  tiger       cat   30.0
2   40   20       canada    893        NaN    dog    NaN       NaN   20.0
3   40   20          usa      4        NaN  horse  horse      lion   40.0
4   45   15          usa   8593        NaN    NaN   lion       cat   10.0
5   20   10  protection1    100     medium    NaN    NaN       NaN    NaN
6   40   20  protection1     20       high    NaN    NaN       NaN    NaN
7   50   30  protection1    500        low    NaN    NaN       NaN    NaN

但我想要的是:

想要的输出:

   lon  lat protection      a      b         c  score
0   20   10     medium    cat    dog  elephant   20.0
1   30   10        NaN   lion  tiger       cat   30.0
2   40   20       high  horse  horse      lion   40.0
3   45   15        NaN    NaN   lion       cat   10.0
4   50   30        low    NaN    NaN       NaN    NaN

输出数据框应包含具有唯一组合的行longlat列,其中仅score保留最高的行,但如果longlat具有重复项和protection列中的值,则应将它们合并为一个

标签: pandasduplicatesrows

解决方案


尝试:

df = df.sort_values(by="score", ascending=False)
g = df.groupby(["lon", "lat"])
df_out = (
    g.first()
    .assign(
        protection=g.agg(
            {"protection": lambda x: ",".join(x.dropna())}
        ).replace("", np.nan)
    )
    .reset_index()
)

print(df_out)

印刷:

   lon  lat         name  value protection      a      b         c  score
0   20   10       canada    563     medium    cat    dog  elephant   20.0
1   30   10       canada     65        NaN   lion  tiger       cat   30.0
2   40   20          usa      4       high  horse  horse      lion   40.0
3   45   15          usa   8593        NaN    NaN   lion       cat   10.0
4   50   30  protection1    500        low    NaN    NaN       NaN    NaN

推荐阅读