首页 > 解决方案 > Pandas 选择行并根据另一列获取最高列值

问题描述

我有数据集,我想根据每个“路径”中的“主题”获得最高分。基本上,我正在努力为学生获得高分。

l1 = ['/some/path/A','/some/path/A','/some/path/A','/some/path/2','/some/path/2']
l2 = ['paper1','paper1','paper1','paper2','paper2']
l3 = ['Tree','Tree','Axe','Leaf','Leaf']
l4 = [.68,.90,.80,.65,.30]
df = pd.DataFrame(
    {'path': l1,
     'paper_name': l2,
     'topic': l3,
     'score':l4
    })
df

当前df

path            paper_name      topic   score
/some/path/A    paper1          Tree    0.68
/some/path/A    paper1          Tree    0.90
/some/path/A    paper1          Axe     0.80
/some/path/2    paper2          Leaf    0.65
/some/path/2    paper2          Leaf    0.30

我试过这个逻辑:

df[df['topic']=='Tree' & df['score'].max()] # hard coding. need something more simpler not hard coded like the way I am doing it. It also errors out because of data types. How do get this?

需要的输出:

path            paper_name      topic   score
/some/path/A    paper1          Tree    0.90
/some/path/A    paper1          Axe     0.80
/some/path/2    paper2          Leaf    0.65

标签: pandas

解决方案


你可以sortdrop_duplicates

df_final = df.sort_values('score').drop_duplicates('topic', keep='last').sort_index()

Out[76]:
  paper_name          path  score topic
1     paper1  /some/path/A   0.90  Tree
2     paper1  /some/path/A   0.80   Axe
3     paper2  /some/path/2   0.65  Leaf

或者groupby.idxmax切片

df_final = df.loc[df.groupby('topic', sort=False).score.idxmax()]

Out[82]:
  paper_name          path  score topic
1     paper1  /some/path/A   0.90  Tree
2     paper1  /some/path/A   0.80   Axe
3     paper2  /some/path/2   0.65  Leaf

推荐阅读