python - 旋转表格并添加功能
问题描述
我创建了以下数据框,其中一个给定的作业work_id
由一个学生s_id
在 date 中执行work_date
。Awork_id
通常由三个测试组成,由test_code指定,并且在数据框中,每个执行的测试都有一个分数和一个相对阈值thresh
。如果执行comment1
,也会出现注释。test1
In [5]: import pandas as pd
...: import numpy as np
...:
...: df = pd.DataFrame(columns=['work_id', 'test_code', 's_id', 'score','thresh','work_date','comment1'],
...: data =[['a1','test1', 'p01',5, 1,'2020-06-15','score was 5'],
...: ['a2','test1', 'p01',10,1,'2020-06-10','score was 10'],
...: ['a2','test3', 'p01',7, 3,'2020-06-10',np.nan],
...: ['a3','test2', 'p01',6, 2,'2020-05-01',np.nan],
...: ['a4','test1', 'p02',4,1,'2020-06-20','score was 4'],
...: ['a4','test2', 'p02',5,2,'2020-06-20',np.nan],
...: ['a5','test3', 'p02',2,3,'2019-10-10',np.nan],
...: ['a6','test3', 'p02',7,3,'2020-04-01',np.nan]])
...: df
Out[5]:
work_id test_code s_id score thresh work_date comment1
0 a1 test1 p01 5 1 2020-06-15 score was 5
1 a2 test1 p01 10 1 2020-06-10 score was 10
2 a2 test3 p01 7 3 2020-06-10 NaN
3 a3 test2 p01 6 2 2020-05-01 NaN
4 a4 test1 p02 4 1 2020-06-20 score was 4
5 a4 test2 p02 5 2 2020-06-20 NaN
6 a5 test3 p02 2 3 2019-10-10 NaN
7 a6 test3 p02 7 3 2020-04-01 NaN
我想旋转具有work_id
索引的数据框。一个简单的数据透视表是这样的:
work_id s_id score_1 score_2 score_3 thresh_1 thresh_2 thresh_3 date comment1
0 a1 p01 5.0 NaN NaN 1.0 NaN NaN 2020-06-15 score was 5
1 a2 p01 10.0 NaN 7.0 1.0 NaN 3.0 2020-06-10 score was 10
2 a3 p01 NaN 6.0 NaN NaN 2.0 NaN 2020-05-01 NaN
3 a4 p02 4.0 5.0 NaN 1.0 2.0 NaN 2020-06-20 score was 4
4 a5 p02 NaN NaN 2.0 NaN NaN 3.0 2019-10-10 NaN
5 a6 p02 NaN NaN 7.0 NaN NaN 3.0 2020-04-01 NaN
使用
df_p = pd.pivot_table(df, index=['work_id','s_id','work_date','comment1'], columns='test_code', values=['score','thresh']).reset_index()
不包括所有缺少注释的行。
此外,对于每一个work_id
,我想
- 用相同的s_id执行的前一个值(即过去最近work_date的work_id )替换缺失
test_code
值,work_id
- 添加一列
Delta1
,显示分数的变化test1
相对于前work_id
一个相同的执行,以及相同执行的所有 test1s_id
的平均值,Mean1
s_id
最终得到这个:
work_id s_id score_1 score_2 score_3 thresh_1 thresh_2 thresh_3 date comment1 Delta1 Mean1
0 a1 p01 5.0 6.0 7.0 1.0 2.0 3.0 2020-06-15 score was 5 -5.0 7.5
1 a2 p01 10.0 6.0 7.0 1.0 2.0 3.0 2020-06-10 score was 10 NaN NaN
2 a3 p01 NaN 6.0 NaN NaN 2.0 NaN 2020-05-01 NaN NaN NaN
3 a4 p02 4.0 5.0 7.0 1.0 2.0 3.0 2020-06-20 score was 4 NaN 4.0
4 a5 p02 NaN NaN 2.0 NaN NaN 3.0 2019-10-10 NaN NaN NaN
5 a6 p02 NaN NaN 7.0 NaN NaN 3.0 2020-04-01 NaN NaN NaN
解决方案
采用:
# step-1
df1 = df.astype({'comment1': 'str'}).set_index(
['work_id', 's_id', 'comment1', 'work_date', 'test_code']).unstack()
df1.columns = df1.columns.map(lambda s: '_'.join([s[0], s[1].strip('test')]))
# step-2
df1 = df1.reset_index()
df1['comment1'] = df1['comment1'].replace('nan', np.nan)
df1 = df1.groupby(['work_id', 's_id'], as_index=False).first().sort_values('work_date')
# step-3
g = df1['score_1'].groupby(df1['s_id'])
df1['Delta1'] = g.diff()
df1['Mean1'] = g.transform('mean').mask(lambda x: x.duplicated(keep='last'))
# step-4
cols = df1.columns.str.contains(r'_\d+$')
df1.loc[:, cols] = df1.loc[:, cols].groupby(df1['s_id']).ffill()
df1 = df1.sort_index()
细节:
步骤 1:DataFrame.unstack
用于旋转数据框并使用map
withjoin
来展平列MultiLevel
。
# step-1
score_1 score_2 score_3 thresh_1 thresh_2 thresh_3
work_id s_id comment1 work_date
a1 p01 score was 5 2020-06-15 5.0 NaN NaN 1.0 NaN NaN
a2 p01 nan 2020-06-10 NaN NaN 7.0 NaN NaN 3.0
score was 10 2020-06-10 10.0 NaN NaN 1.0 NaN NaN
a3 p01 nan 2020-05-01 NaN 6.0 NaN NaN 2.0 NaN
a4 p02 nan 2020-06-20 NaN 5.0 NaN NaN 2.0 NaN
score was 4 2020-06-20 4.0 NaN NaN 1.0 NaN NaN
a5 p02 nan 2019-10-10 NaN NaN 2.0 NaN NaN 3.0
a6 p02 nan 2020-04-01 NaN NaN 7.0 NaN NaN 3.0
步骤 2:使用groupby
onwork_id
和s_id
使用聚合函数first
并对数据框进行排序work_date
# step-2
work_id s_id comment1 work_date score_1 score_2 score_3 thresh_1 thresh_2 thresh_3
4 a5 p02 NaN 2019-10-10 NaN NaN 2.0 NaN NaN 3.0
5 a6 p02 NaN 2020-04-01 NaN NaN 7.0 NaN NaN 3.0
2 a3 p01 NaN 2020-05-01 NaN 6.0 NaN NaN 2.0 NaN
1 a2 p01 score was 10 2020-06-10 10.0 NaN 7.0 1.0 NaN 3.0
0 a1 p01 score was 5 2020-06-15 5.0 NaN NaN 1.0 NaN NaN
3 a4 p02 score was 4 2020-06-20 4.0 5.0 NaN 1.0 2.0 NaN
步骤 3:计算Delta1
并通过对onMean1
进行分组并使用适当的转换函数。score_1
s_id
# step-3
work_id s_id comment1 work_date score_1 score_2 score_3 thresh_1 thresh_2 thresh_3 Delta1 Mean1
4 a5 p02 NaN 2019-10-10 NaN NaN 2.0 NaN NaN 3.0 NaN NaN
5 a6 p02 NaN 2020-04-01 NaN NaN 7.0 NaN NaN 3.0 NaN NaN
2 a3 p01 NaN 2020-05-01 NaN 6.0 NaN NaN 2.0 NaN NaN NaN
1 a2 p01 score was 10 2020-06-10 10.0 NaN 7.0 1.0 NaN 3.0 NaN NaN
0 a1 p01 score was 5 2020-06-15 5.0 NaN NaN 1.0 NaN NaN -5.0 7.5
3 a4 p02 score was 4 2020-06-20 4.0 5.0 NaN 1.0 2.0 NaN NaN 4.0
第 4 步:将缺失test_code
值替换为之前的work_id
.
# step-4
work_id s_id comment1 work_date score_1 score_2 score_3 thresh_1 thresh_2 thresh_3 Delta1 Mean1
0 a1 p01 score was 5 2020-06-15 5.0 6.0 7.0 1.0 2.0 3.0 -5.0 7.5
1 a2 p01 score was 10 2020-06-10 10.0 6.0 7.0 1.0 2.0 3.0 NaN NaN
2 a3 p01 NaN 2020-05-01 NaN 6.0 NaN NaN 2.0 NaN NaN NaN
3 a4 p02 score was 4 2020-06-20 4.0 5.0 7.0 1.0 2.0 3.0 NaN 4.0
4 a5 p02 NaN 2019-10-10 NaN NaN 2.0 NaN NaN 3.0 NaN NaN
5 a6 p02 NaN 2020-04-01 NaN NaN 7.0 NaN NaN 3.0 NaN NaN
推荐阅读
- python - 在使用 filedialog 打开的文件夹中创建子文件夹
- angular - 如何仅检测角度形式的变化值?
- c++ - 将取消引用的智能指针的地址传递给需要原始指针的函数
- python - 使用 scikit-learn 预测单个值会导致 ValueError
- json - 如何使用 gin gonic 在 Go 中上传多部分文件和 json?
- javascript - 从 json 解析数组的 java 脚本
- c# - EF CORE datetime 保存时间为 0
- reactjs - 如何在渲染后更新状态
- git - Github 桌面/VsCcode 身份验证失败
- javascript - 是否有更简洁的方法来检查变量是否具有类型编号?