首页 > 解决方案 > 计算由两列过滤的平均数据框

问题描述

我有这个数据框

    Unnamed: 0  Datetime    HomeTeam    AwayTeam    Ball PossessionMatch_H  Ball PossessionMatch_A
0   0   2021-05-24 02:30:00 U. De Chile Everton                         68      32
1   1   2021-05-23 21:00:00 Huachipato  Colo Colo                       48      52
2   2   2021-05-23 18:30:00 Melipilla   Antofagasta                     47      53
3   3   2021-05-23 02:30:00 U. Espanola U. Catolica                     37      63
4   4   2021-05-23 00:00:00 S. Wanderers    O'Higgins                   29      71
... ... ... ... ... ... ...
57  57  2021-03-28 15:45:00 Palestino   Antofagasta                     58      42
58  58  2021-03-28 01:00:00 U. Espanola S. Wanderers                    50      50
59  59  2021-03-27 22:30:00 Colo Colo   Union La Calera                 58      42
60  60  2021-03-27 20:00:00 Everton O'Higgins                           54      46
61  61  2021-03-27 15:00:00 Curico Unido    Melipilla                   41      59

我想将其拆分为多个数据帧并在“HomeTeam”和“AwayTeam”中应用两个标准,然后计算 Ball Possession 的平均值并将其放入新列“Ball PossessionMatch_H/MP”中(如果球队在“HomeTeam”中)和“Ball PossessionMatch_A/MP”,如果球队在“AwayTeam”

代码:

teams = pd.unique(df[['HomeTeam', 'AwayTeam']].values.ravel('K'))
df_list3 = []
for team in teams:
    if team in df['HomeTeam']:
        current_df = df[(df.HomeTeam == team) | (df.AwayTeam == team)].copy()
        current_df["Ball PossessionMatch_H/MP"] = current_df.apply(
            lambda x: (
                current_df[(current_df.HomeTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_H"].sum()
                + current_df[(current_df.AwayTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_A"].sum()
            )
            / (
                current_df[(current_df.HomeTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_H"].size
                + current_df[(current_df.AwayTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_A"].size
            ),
            axis=1,
        )
    else:
        current_df = df[(df.HomeTeam == team) | (df.AwayTeam == team)].copy()
        current_df["Ball PossessionMatch_A/MP"] = current_df.apply(
            lambda x: (
                current_df[(current_df.HomeTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_H"].sum()
                + current_df[(current_df.AwayTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_A"].sum()
            )
            / (
                current_df[(current_df.HomeTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_H"].size
                + current_df[(current_df.AwayTeam == team)].loc[x.name + 1 :, "Ball PossessionMatch_A"].size
            ),
            axis=1,
        )
        df_list3.append(current_df)

df = pd.concat(df_list3)
print(df)

它不考虑if并仅创建具有计算值的“Ball PossessionMatch_A/MP”列并复制行,我会将append结果复制到原始数据框。

预期结果:

    Unnamed: 0  Datetime    HomeTeam    AwayTeam    Ball PossessionMatch_H  Ball PossessionMatch_A  Ball PossessionMatch_H/MP   Ball PossessionMatch_A/MP
0   0   2021-05-24 02:30:00 U. De Chile Everton               68           32         50.33      43.5

标签: pythonpandasdataframe

解决方案


下面的呢?

hometeam_count = df.groupby("HomeTeam")["Ball PossessionMatch_H"].count()
hometeam_sum = df.groupby("HomeTeam")["Ball PossessionMatch_H"].sum()
awayteam_count = df.groupby("AwayTeam")["Ball PossessionMatch_A"].count()
awayteam_sum = df.groupby("AwayTeam")["Ball PossessionMatch_A"].sum()

df["Ball PossessionMatch_H/MP"] = df["HomeTeam"].apply(lambda x: ((hometeam_sum.loc[x] if x in hometeam_sum.index else 0) + (awayteam_sum.loc[x] if x in awayteam_sum.index else 0)) / ((hometeam_count.loc[x] if x in hometeam_count.index else 0)  + (awayteam_count.loc[x] if x in awayteam_count.index else 0)))
df["Ball PossessionMatch_A/MP"] = df["AwayTeam"].apply(lambda x: ((hometeam_sum.loc[x] if x in hometeam_sum.index else 0) + (awayteam_sum.loc[x] if x in awayteam_sum.index else 0)) / ((hometeam_count.loc[x] if x in hometeam_count.index else 0)  + (awayteam_count.loc[x] if x in awayteam_count.index else 0)))

上面的代码利用该groupby函数来计算每个团队的 和sum以及单独的。 代码的第二部分基本上是分别查找 ( ) 对应的和值,并按预期执行平均计算。countBall PossessionMatch_ABall PossessionMatch_H
locsumcountHomeTeamAwayTeam

正如你所知道的,两者的代码HomeTeam几乎AwayTeam相同,也被清理了,这样我们就不会重复你自己了。我这样做纯粹是为了便于阅读。

尽管这并不能解决问题(至于为什么没有在您创建的列中填写值),但我提供了该问题的替代解决方案,我认为该解决方案更直观且更具可读性。


推荐阅读