python-3.x - 使用一列中的重复值删除熊猫数据框中的整行
问题描述
我在以下链接中上传了 .csv 文件中的数据
在这个文件中,我有以下列:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
Team列中将有重复项。另一列是SimStage。Simstage 有一系列包含从 0 到 N 的数据(在本例中为 0 到 4)
我想在每个 Simstage 值处为每个团队保留行(即其余部分将被删除)。当我们删除时,将为每个团队和 SimStage 删除列Points中具有较低值的重复行。由于单独用文字解释有点困难,我在这里附上一张图片。
在此图中,以红色框突出显示的行将被删除。
我用过df.duplicates()
,但它不起作用。
解决方案
看起来您只想保留“点”列中的最高值。因此,使用first
pandas中的聚合函数
创建数据框并调用它df
data = {'Team': {0: 'Brazil', 1: 'Brazil', 2: 'Brazil', 3: 'Brazil', 4: 'Brazil', 5: 'Brazil', 6: 'Brazil', 7: 'Brazil', 8: 'Brazil', 9: 'Brazil'},
'Group': {0: 'Group E', 1: 'Group E', 2: 'Group E', 3: 'Group E', 4: 'Group E', 5: 'Group E', 6: 'Group E', 7: 'Group E', 8: 'Group E', 9: 'Group E'},
'Model': {0: 'ELO', 1: 'ELO', 2: 'ELO', 3: 'ELO', 4: 'ELO', 5: 'ELO', 6: 'ELO', 7: 'ELO', 8: 'ELO', 9: 'ELO'},
'SimStage': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4, 9: 4},
'Points': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 1, 6: 2, 7: 4, 8: 4, 9: 1},
'GpWinner': {0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2, 5: 0.0, 6: 0.2, 7: 0.2, 8: 0.2, 9: 0.0},
'GpRunnerup': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.2, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.2},
'3rd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
'4th': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}}
df = pd.DataFrame(data)
# To be able to output the dataframe in your original order
columns_order = ['Team', 'Group', 'Model', 'SimStage', 'Points', 'GpWinner', 'GpRunnerup', '3rd', '4th']
方法一
# Sort the values by 'Points' descending and 'SimStage' ascending
df = df.sort_values('Points', ascending=False)
df = df.sort_values('SimStage')
# Group the columns by the necessary columns
df = df.groupby(['Team', 'SimStage'], as_index=False).agg('first')
# Output the dataframe in the orginal order
df[columns_order]
Out[]:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
0 Brazil Group E ELO 0 4 0.2 0.0 0 0
1 Brazil Group E ELO 1 4 0.2 0.0 0 0
2 Brazil Group E ELO 2 4 0.2 0.0 0 0
3 Brazil Group E ELO 3 4 0.2 0.0 0 0
4 Brazil Group E ELO 4 4 0.2 0.0 0 0
方法二
df.sort_values('Points', ascending=False).drop_duplicates(['Team', 'SimStage'])[columns_order]
Out[]:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
0 Brazil Group E ELO 0 4 0.2 0.0 0 0
2 Brazil Group E ELO 1 4 0.2 0.0 0 0
4 Brazil Group E ELO 2 4 0.2 0.0 0 0
7 Brazil Group E ELO 3 4 0.2 0.0 0 0
8 Brazil Group E ELO 4 4 0.2 0.0 0 0
推荐阅读
- javascript - 重复 this.$store.commit 不是函数问题
- ruby-on-rails - Rails Paperclip 使用自定义 ID 进行插值和 before_save
- c# - 从父类访问子类
- unity3d - IEnumerator 中设置的动画触发器导致动画卡住
- elasticsearch - kibana 尝试连接时,Elasticsearch 连接被拒绝
- javascript - 数字根JS
- laravel - 如何验证密码重置令牌是否存在于表中,如果不存在则显示消息
- react-native - 在 FlatList 反应原生“keyboardDismissMode”
- php - 试图在 php 7 中获得正确的表达式
- android - FirebaseUi phone auth Android 发生未知错误