首页 > 解决方案 > 按唯一 ID 分组、应用函数并为下一组更新特定列

问题描述

我有一个如下所示的数据框:


In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'match_id': ['m1', 'm1', 'm1', 'm1', 'm2', 'm2', 'm2', 'm2', 'm3', 'm3', 'm3', 'm3'],
   ...:                     'name':['peter', 'mike', 'jeff', 'john', 'alex', 'joe', 'jeff', 'peter', 'alex', 'peter', '
   ...: joe', 'tom' ],
   ...:                     'rank': [2, 3, 1, 4, 3, 1, 2, 4, 4, 3, 1, 2],
   ...:                     'rating': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]})

In [3]: df
Out[3]:
    match_id    name    rank  rating
0          m1  peter     2     100
1          m1   mike     3     100
2          m1   jeff     1     100
3          m1   john     4     100
4          m2   alex     3     100
5          m2    joe     1     100
6          m2   jeff     2     100
7          m2  peter     4     100
8          m3   alex     4     100
9          m3  peter     3     100
10         m3    joe     1     100
11         m3    tom     2     100

它大约是三场比赛,具有唯一的“match_id”、参与者的姓名、他们在比赛结束时的排名,以及整个数据帧的默认评分手动设置为 100。

我想根据“match_id”对数据进行分组,并分别为每个匹配运行一个函数,但该函数的输出应该用于更新下一匹配的列。

我想使用一个函数来计算每场比赛后球员更新的评分,并将其放在一个名为“updated_rating”的新列中。我厌倦的功能在第一场比赛中看起来像这样:

df = df.loc[df['match_id'] == 'm1']
N = len(df)
df['win_prob'] = 0.0
for i in range(N):
    for j in range(N):
        if i != j:
            df['S'] = (N - df['rank']) / ((N*(N-1))/2)
            df['win_prob'][i] += (1 / (1 + (10 ** ((df['rating'][i] - df['rating'][j])/400))))
            df['normalized_win_prob'] = df['win_prob']/(N*(N-1)/2)
            df['updated_rating'] = round(df['rating'] + (20 * (df['S'] - df['normalized_win_prob'])), 1)

这将在第一场比赛中发挥作用,并根据每个玩家的原始评分计算更新后的评分以及获胜的概率。但是,我无法将其扩展到以下匹配项。

由于一些球员在下一场比赛中重新出现,我想更新他们的评分(基于前一阶段计算的“updated_rating”列),让函数在第二场比赛和之后的第三场比赛中完成这项工作。

因此,例如,第一次匹配计算后的输出将如下所示:


match_id name rank rating  win_prob    S    normalized_win_prob  updated_rating
0   m1  peter   2   100     1.5     0.333333          0.25            101.7
1   m1  mike    3   100     1.5     0.166667          0.25             98.3
2   m1  jeff    1   100     1.5     0.500000          0.25            105.0
3   m1  john    4   100     1.5     0.000000          0.25             95.0

关于如何以有效的方式做到这一点的任何想法?我的原始数据框比这个示例数据框大得多,所以我的解决方案需要高效。

谢谢

标签: pythonpandasdataframepandas-groupbyupdating

解决方案


这将是我的解决方案。由于您的算法必须逐个遍历 match_ids,因此我们for-loop首先需要分组数据。然后要计算win_prob,您必须遍历每一行并计算其在同一场比赛中战胜其他行的相关概率。这不漂亮。虽然想不出更好的方法:(

df = pd.DataFrame({'match_id': ['m1', 'm1', 'm1', 'm1', 'm2', 'm2', 'm2', 'm2', 'm3', 'm3', 'm3', 'm3', 'm4', 'm4', 'm4', 'm4'],
                   'name':['peter', 'mike', 'jeff', 'john', 'alex', 'joe', 'jeff', 'peter', 'alex', 'peter', 'joe', 'tom', 'mike', 'john', 'tom', 'peter'],
                   'rank': [2, 3, 1, 4, 3, 1, 2, 4, 4, 3, 1, 2, 1, 3, 4, 2],
                   'rating': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]})

# Pre-compute variables that don't depend on ratings
df['N'] = df.groupby('match_id')['name'].transform('count')
df['total_comb'] = ((df['N']*(df['N']-1))/2)
df['S'] = (df['N'] - df['rank']) / df['total_comb']

# Initialize win_prob and updated_rating
df['win_prob'] = np.zeros(len(df))
df['updated_rating'] = df['rating']
df['prev_rating'] = df['rating']

grouped = df.groupby('match_id', sort=True)

dfa = pd.DataFrame() #Final results will be stored here
last_names = []
#Loop through the match_ids from m1 to m2, m3. Note you can sort them when use 'groupby'
for name, dfg in grouped:
    dfm = dfg.copy()
    # Update the 'updated_rating' coming from last match_id
    if len(last_names) > 0:
        dfm.drop(columns=['updated_rating'], inplace=True)
        df_last = dfa.loc[dfa['match_id'].isin(last_names),['name', 'updated_rating']]
        df_last.drop_duplicates(subset=['name'], keep='last', inplace=True)
        dfm = dfm.merge(df_last, left_on='name', right_on='name', how='left')
        dfm['prev_rating'] = np.where(np.isnan(dfm['updated_rating']), dfm['rating'], dfm['updated_rating'])

    # Compute current 'updated_rating'
    win_prob = []
    for index, row in dfm.iterrows():
        prob = np.sum(1.0/(1+10**((row['prev_rating'] - dfm['prev_rating'])/400)))-0.5 #subtract 0.5 to account for self
        win_prob.append(prob)

    dfm['win_prob'] = win_prob
    dfm['normalized_win_prob'] = dfm['win_prob']/dfm['total_comb']
    dfm['updated_rating'] = round(dfm['prev_rating'] + (20 * (dfm['S'] - dfm['normalized_win_prob'])), 1)
    last_names.append(name)
    dfa = pd.concat([dfa, dfm], sort=True)

dfa   

输出:

N       S         match_id  name    normalized_win_prob prev_rating   rank  rating  total_comb  updated_rating  win_prob
4   0.333333333     m1      peter   0.25                    100         2   100     6               101.7       1.5
4   0.166666667     m1      mike    0.25                    100         3   100     6               98.3        1.5
4   0.5             m1      jeff    0.25                    100         1   100     6               105         1.5
4   0               m1      john    0.25                    100         4   100     6               95          1.5
4   0.166666667     m2      alex    0.251606926             100         3   100     6               98.3        1.509641559
4   0.5             m2      joe     0.251606926             100         1   100     6               105         1.509641559
4   0.333333333     m2      jeff    0.24681015              105         2   100     6               106.7       1.480860898
4   0               m2      peter   0.249975997             101.7       4   100     6               96.7        1.499855985
4   0               m3      alex    0.251630798             98.3        4   100     6               93.3        1.509784788
4   0.166666667     m3      peter   0.253165649             96.7        3   100     6               95          1.518993896
4   0.5             m3      joe     0.245203608             105         1   100     6               110.1       1.47122165
4   0.333333333     m3      tom     0.249999944             100         2   100     6               101.7       1.499999666
4   0.5             m4      mike    0.249232493             98.3        1   100     6               103.3       1.495394959
4   0.166666667     m4      john    0.252398303             95          3   100     6               93.3        1.514389819
4   0               m4      tom     0.2459709               101.7       4   100     6               96.8        1.475825403
4   0.333333333     m4      peter   0.252398303             95          2   100     6               96.6        1.514389819



推荐阅读