python - 如何计算 pandas 中列的分组加权聚合？

问题描述

我是一名主要的 JS 开发人员，正在尝试使用 pandas 并执行一些数据分析。该分析的一部分包括将球队的比赛表现（赢/输）转换为数字评级（基于胜率）。

TLDR：我正在尝试从 DF 1 到 DF 3。

东风1

|   season  | opponent  |   outcome |
-------------------------------------
|   2020    |   A       |   w       |
|   2020    |   A       |   l       |
|   2020    |   B       |   w       |
|   2020    |   B       |   w       |
|   2020    |   C       |   l       |
|   2020    |   C       |   l       |
|   2021    |   A       |   w       |
|   2021    |   A       |   w       |
|   2021    |   B       |   w       |
|   2021    |   B       |   l       |
|   2021    |   C       |   w       |
|   2021    |   C       |   w       |

我需要计算按赛季和对手分组的胜率。

东风2

|   season  | opponent  |  win %    |
-------------------------------------
|   2020    |   A       |   50      |
|   2020    |   B       |   100     |
|   2020    |   C       |   0       |
|   2021    |   A       |   100     |
|   2021    |   B       |   50      |
|   2021    |   C       |   100     |

之后，我们需要计算每个赛季的收视率。这是通过对同一赛季各队的胜率进行平均来完成的，但需要注意的是，对阵 A 队的胜率是其他球队的两倍。这只是任意公式，实际计算更复杂（不同的对手有不同的权重 - 我需要一种方法将其作为自定义 Lambda 函数或其他东西的一部分传递）但我试图简化这个问题的事情。

东风3

|   season  |   rating  |
-------------------------
|   2020    |   50.0    |
|   2021    |   87.5    |

评分计算示例：2020 赛季评分 = A 队 % * 2 + B 队获胜率 % + C 队获胜率 /（队伍总数 + 1）=（50% * 2 + 100% + 0%）/（3 + 1 ) = 50.0

我们如何使用 pandas 从第一个数据帧到最后一个数据帧？我可以使用以下方法获得 DF 2 的版本

df2 = df1.groupby(["season", "opponent"])["outcome"].value_counts(normalize = True).to_frame()

此框架包括不需要的损失百分比，但作为“转换”到 DF 3 的一部分，我是否能够过滤/丢弃它并不重要。

我一直在尝试做类似的事情df2 = df2[df2["outcome"] != "w"]，或者df2 = df2.query('outcome != "w"')根据另一个问题的答案删除带有丢失条件的附加行，但无济于事。我怀疑这是因为outcome是嵌套列。也注意到了这个问题，但我认为我需要一个“通配符”来访问嵌套outcome列，而不管opponent.

注意：如果有更有效的方法可以直接从 DF 1 到 DF 3（这看起来很接近但并不完全），我也很乐意探索这些方法。

标签： pythonpandasdataframe

import pandas as pd

df_test = pd.DataFrame(data={'season':[2020]*6 + [2021]*6, 'opponent': ['A', 'A', 'B', 'B', 'C', 'C']*2,
                        'outcome': ['w', 'l', 'w', 'w', 'l', 'l', 'w', 'w', 'w', 'l', 'w', 'w']})

df_weightage = pd.DataFrame(data={'season':[2020]*3 + [2021]*3, 'opponent': ['A', 'B', 'C']*2,
                        'weightage': [0.2, 0.3, 0.5, 0.1, 0.2, 0.7]})

print(df_test)
print('='*30)
print(df_weightage)
print('='*35)

def get_pct(data):
    return len(data[data == 'w'])/len(data)

def get_rating(data):
    return sum(data['win_percentage']*data['weightage'])/len(data)

df_test = df_test.groupby(["season", "opponent"])["outcome"].apply(get_pct).rename('win_percentage').reset_index()
print(df_test)
print('='*45)

df_test = df_test.merge(df_weightage, how= 'left', on=['season', 'opponent'])
print(df_test)
print('='*45)

df_ratings = df_test.groupby(['season'])[['win_percentage', 'weightage']].apply(get_rating).rename('ratings').reset_index()
print(df_ratings)

python - 如何计算 pandas 中列的分组加权聚合？

问题描述

解决方案

推荐阅读