首页 > 解决方案 > 数据框将移动数据转移到随机列中?

问题描述

我正在使用代码来移动看起来与此有些相似的时间序列数据:

Year    Player          PTSN    AVGN                               
2018    Aaron Donald    280.60  17.538  
2018    J.J. Watt       259.80  16.238  
2018    Danielle Hunter 237.60  14.850  
2017    Aaron Donald    181.0   12.929  
2016    Danielle Hunter 204.6   12.788

目的是让它变成这样的东西:

                        AVGN   PTSN  AVGN_prev  PTSN_prev
Player          Year                                     
Aaron Donald    2016     NaN    NaN        NaN        NaN
                2017  12.929  181.0        NaN        NaN
                2018  17.538  280.6     12.929      181.0
Danielle Hunter 2016  12.788  204.6        NaN        NaN
                2017   8.325  133.2     12.788      204.6
                2018  14.850  237.6      8.325      133.2
J.J. Watt       2016     NaN    NaN        NaN        NaN
                2017     NaN    NaN        NaN        NaN
                2018  16.238  259.8        NaN        NaN

我正在使用此代码来实现这一点:

res = df.set_index(['player', 'Year'])

idx = pd.MultiIndex.from_product([df['player'].unique(), 
                                  df['Year'].unique()],
                                names=['Player', 'Year'])

res = res.groupby(['player', 'Year']).apply(sum)

res = res.reindex(idx).sort_index()
res[columns] = res.groupby('Player')[list(res.columns)].shift(1)

添加了 groupby.sum() 因为数据框中的一些球员在同一赛季从一个团队转移到另一个团队,我想合并这些数字。然而,我所拥有的数据实际上是非常错误的。数据有太多列要发布,但似乎上一年的数据 (_prev) 被放置到随机列中。它不会改变,并且总是将它放在相同的错误列中。这是由 groupby.sum() 引起的问题吗?是因为我使用了一个列变量(包含所有与 res.columns 相同的名称,并附加了一个 str(_prev))和一个列表(res.columns)?不管它是什么,我该如何解决这个问题?

这是列和 res.columns 的输出:

列:

['player_id_prev', 'position_prev', 'player_game_count_prev', 'team_name_prev', 'snap_counts_total_prev', 'snap_counts_pass_rush_prev', 'snap_counts_run_defense_prev', 'snap_counts_coverage_prev', 'grades_defense_prev', 'grades_run_defense_prev', 'grades_tackle_prev', 'grades_pass_rush_defense_prev', 'grades_coverage_defense_prev', 'total_pressures_prev', 'sacks_prev', 'hits_prev', 'hurries_prev', 'batted_passes_prev', 'tackles_prev', 'assists_prev', 'missed_tackles_prev', 'stops_prev', 'forced_fumbles_prev', 'targets_prev', 'receptions_prev', 'yards_prev', 'yards_per_reception_prev', 'yards_after_catch_prev', 'longest_prev', 'touchdowns_prev', 'interceptions_prev', 'pass_break_ups_prev', 'qb_rating_against_prev', 'penalties_prev', 'declined_penalties_prev']

res_columns:

['player_id', 'position', 'player_game_count', 'team_name',
       'snap_counts_total', 'snap_counts_pass_rush', 'snap_counts_run_defense',
       'snap_counts_coverage', 'grades_defense', 'grades_run_defense',
       'grades_tackle', 'grades_pass_rush_defense', 'grades_coverage_defense',
       'total_pressures', 'sacks', 'hits', 'hurries', 'batted_passes',
       'tackles', 'assists', 'missed_tackles', 'stops', 'forced_fumbles',
       'targets', 'receptions', 'yards', 'yards_per_reception',
       'yards_after_catch', 'longest', 'touchdowns', 'interceptions',
       'pass_break_ups', 'qb_rating_against', 'penalties',
       'declined_penalties']

测试时两者的长度均为 35。

标签: pythonpandas

解决方案


我建议使用:

#first aggregate for unique MultiIndex 
res = df.groupby(['Player', 'Year']).sum()

#MultiIndex
idx = pd.MultiIndex.from_product(res.index.levels,
                                names=['Player', 'Year'])
#aded new missing years 
res = res.reindex(idx).sort_index()

#shift all columns, add suffix and join to original
res = res.join(res.groupby('Player').shift().add_suffix('_prev'))
print (res)
                       PTSN    AVGN  PTSN_prev  AVGN_prev
Player          Year                                     
Aaron Donald    2016    NaN     NaN        NaN        NaN
                2017  181.0  12.929        NaN        NaN
                2018  280.6  17.538      181.0     12.929
Danielle Hunter 2016  204.6  12.788        NaN        NaN
                2017    NaN     NaN      204.6     12.788
                2018  237.6  14.850        NaN        NaN
J.J. Watt       2016    NaN     NaN        NaN        NaN
                2017    NaN     NaN        NaN        NaN
                2018  259.8  16.238        NaN        NaN

推荐阅读