首页 > 解决方案 > 旋转表格并添加功能

问题描述

我创建了以下数据框,其中一个给定的作业work_id由一个学生s_id在 date 中执行work_date。Awork_id通常由三个测试组成,由test_code指定,并且在数据框中,每个执行的测试都有一个分数和一个相对阈值thresh。如果执行comment1,也会出现注释。test1

In [5]: import pandas as pd
   ...: import numpy as np
   ...:
   ...: df = pd.DataFrame(columns=['work_id', 'test_code', 's_id', 'score','thresh','work_date','comment1'],
   ...:                   data =[['a1','test1', 'p01',5, 1,'2020-06-15','score was 5'],
   ...:                          ['a2','test1', 'p01',10,1,'2020-06-10','score was 10'],
   ...:                          ['a2','test3', 'p01',7, 3,'2020-06-10',np.nan],
   ...:                          ['a3','test2', 'p01',6, 2,'2020-05-01',np.nan],
   ...:                          ['a4','test1', 'p02',4,1,'2020-06-20','score was 4'],
   ...:                          ['a4','test2', 'p02',5,2,'2020-06-20',np.nan],
   ...:                          ['a5','test3', 'p02',2,3,'2019-10-10',np.nan],
   ...:                          ['a6','test3', 'p02',7,3,'2020-04-01',np.nan]])
   ...: df
Out[5]:
  work_id test_code s_id  score  thresh   work_date      comment1
0      a1     test1  p01      5       1  2020-06-15   score was 5
1      a2     test1  p01     10       1  2020-06-10  score was 10
2      a2     test3  p01      7       3  2020-06-10           NaN
3      a3     test2  p01      6       2  2020-05-01           NaN
4      a4     test1  p02      4       1  2020-06-20   score was 4
5      a4     test2  p02      5       2  2020-06-20           NaN
6      a5     test3  p02      2       3  2019-10-10           NaN
7      a6     test3  p02      7       3  2020-04-01           NaN

我想旋转具有work_id索引的数据框。一个简单的数据透视表是这样的:

work_id s_id  score_1  score_2  score_3  thresh_1  thresh_2  thresh_3        date      comment1
0      a1  p01      5.0      NaN      NaN       1.0       NaN       NaN  2020-06-15   score was 5
1      a2  p01     10.0      NaN      7.0       1.0       NaN       3.0  2020-06-10  score was 10
2      a3  p01      NaN      6.0      NaN       NaN       2.0       NaN  2020-05-01           NaN
3      a4  p02      4.0      5.0      NaN       1.0       2.0       NaN  2020-06-20   score was 4
4      a5  p02      NaN      NaN      2.0       NaN       NaN       3.0  2019-10-10           NaN
5      a6  p02      NaN      NaN      7.0       NaN       NaN       3.0  2020-04-01           NaN

使用

df_p = pd.pivot_table(df,  index=['work_id','s_id','work_date','comment1'], columns='test_code', values=['score','thresh']).reset_index()

不包括所有缺少注释的行。

此外,对于每一个work_id,我想

  1. 用相同的s_id执行的前一个值(即过去最近work_date的work_id )替换缺​​失test_code值,work_id
  2. 添加一列Delta1,显示分数的变化test1相对于前work_id一个相同的执行,以及相同执行的所有 test1s_id的平均值,Mean1s_id

最终得到这个:

work_id s_id  score_1  score_2  score_3  thresh_1  thresh_2  thresh_3        date      comment1  Delta1  Mean1
0      a1  p01      5.0      6.0      7.0       1.0       2.0       3.0  2020-06-15   score was 5    -5.0    7.5
1      a2  p01     10.0      6.0      7.0       1.0       2.0       3.0  2020-06-10  score was 10     NaN    NaN
2      a3  p01      NaN      6.0      NaN       NaN       2.0       NaN  2020-05-01           NaN     NaN    NaN
3      a4  p02      4.0      5.0      7.0       1.0       2.0       3.0  2020-06-20   score was 4     NaN    4.0
4      a5  p02      NaN      NaN      2.0       NaN       NaN       3.0  2019-10-10           NaN     NaN    NaN
5      a6  p02      NaN      NaN      7.0       NaN       NaN       3.0  2020-04-01           NaN     NaN    NaN

标签: pythonpandasdataframepivotpivot-table

解决方案


采用:

# step-1
df1 = df.astype({'comment1': 'str'}).set_index(
    ['work_id', 's_id', 'comment1', 'work_date', 'test_code']).unstack()
df1.columns = df1.columns.map(lambda s: '_'.join([s[0], s[1].strip('test')]))

# step-2
df1 = df1.reset_index()
df1['comment1'] = df1['comment1'].replace('nan', np.nan)
df1 = df1.groupby(['work_id', 's_id'], as_index=False).first().sort_values('work_date')

# step-3
g = df1['score_1'].groupby(df1['s_id'])
df1['Delta1'] = g.diff()
df1['Mean1'] = g.transform('mean').mask(lambda x: x.duplicated(keep='last'))

# step-4
cols = df1.columns.str.contains(r'_\d+$')
df1.loc[:, cols] = df1.loc[:, cols].groupby(df1['s_id']).ffill()
df1 = df1.sort_index()

细节:

步骤 1:DataFrame.unstack用于旋转数据框并使用mapwithjoin来展平列MultiLevel

# step-1
                                      score_1  score_2  score_3  thresh_1  thresh_2  thresh_3
work_id s_id comment1     work_date                                                          
a1      p01  score was 5  2020-06-15      5.0      NaN      NaN       1.0       NaN       NaN
a2      p01  nan          2020-06-10      NaN      NaN      7.0       NaN       NaN       3.0
             score was 10 2020-06-10     10.0      NaN      NaN       1.0       NaN       NaN
a3      p01  nan          2020-05-01      NaN      6.0      NaN       NaN       2.0       NaN
a4      p02  nan          2020-06-20      NaN      5.0      NaN       NaN       2.0       NaN
             score was 4  2020-06-20      4.0      NaN      NaN       1.0       NaN       NaN
a5      p02  nan          2019-10-10      NaN      NaN      2.0       NaN       NaN       3.0
a6      p02  nan          2020-04-01      NaN      NaN      7.0       NaN       NaN       3.0

步骤 2:使用groupbyonwork_ids_id使用聚合函数first并对数据框进行排序work_date

# step-2
  work_id s_id      comment1   work_date  score_1  score_2  score_3  thresh_1  thresh_2  thresh_3
4      a5  p02           NaN  2019-10-10      NaN      NaN      2.0       NaN       NaN       3.0
5      a6  p02           NaN  2020-04-01      NaN      NaN      7.0       NaN       NaN       3.0
2      a3  p01           NaN  2020-05-01      NaN      6.0      NaN       NaN       2.0       NaN
1      a2  p01  score was 10  2020-06-10     10.0      NaN      7.0       1.0       NaN       3.0
0      a1  p01   score was 5  2020-06-15      5.0      NaN      NaN       1.0       NaN       NaN
3      a4  p02   score was 4  2020-06-20      4.0      5.0      NaN       1.0       2.0       NaN

步骤 3:计算Delta1并通过对onMean1进行分组并使用适当的转换函数。score_1s_id

# step-3
      work_id s_id      comment1   work_date  score_1  score_2  score_3  thresh_1  thresh_2  thresh_3  Delta1  Mean1
    4      a5  p02           NaN  2019-10-10      NaN      NaN      2.0       NaN       NaN       3.0     NaN    NaN
    5      a6  p02           NaN  2020-04-01      NaN      NaN      7.0       NaN       NaN       3.0     NaN    NaN
    2      a3  p01           NaN  2020-05-01      NaN      6.0      NaN       NaN       2.0       NaN     NaN    NaN
    1      a2  p01  score was 10  2020-06-10     10.0      NaN      7.0       1.0       NaN       3.0     NaN    NaN
    0      a1  p01   score was 5  2020-06-15      5.0      NaN      NaN       1.0       NaN       NaN    -5.0    7.5
    3      a4  p02   score was 4  2020-06-20      4.0      5.0      NaN       1.0       2.0       NaN     NaN    4.0

第 4 步:将缺失test_code值替换为之前的work_id.

# step-4
  work_id s_id      comment1   work_date  score_1  score_2  score_3  thresh_1  thresh_2  thresh_3  Delta1  Mean1
0      a1  p01   score was 5  2020-06-15      5.0      6.0      7.0       1.0       2.0       3.0    -5.0    7.5
1      a2  p01  score was 10  2020-06-10     10.0      6.0      7.0       1.0       2.0       3.0     NaN    NaN
2      a3  p01           NaN  2020-05-01      NaN      6.0      NaN       NaN       2.0       NaN     NaN    NaN
3      a4  p02   score was 4  2020-06-20      4.0      5.0      7.0       1.0       2.0       3.0     NaN    4.0
4      a5  p02           NaN  2019-10-10      NaN      NaN      2.0       NaN       NaN       3.0     NaN    NaN
5      a6  p02           NaN  2020-04-01      NaN      NaN      7.0       NaN       NaN       3.0     NaN    NaN

推荐阅读