python - Pandas:使用枢轴函数更快地进行复杂列转置
问题描述
简单地说,我需要将下面的输入数据帧转换为下面的输出。
经过几个小时努力弄清楚如何通过结合以前的多个 stackoverflow 问题来转换数据帧,但是由于我使用了 pivot 和 apply 方法,因此转换大型数据帧需要很长时间。
import numpy as np
import pandas as pd
df = pd.DataFrame({"id":[1,2,3,4,5],
"day":pd.Timestamp('20190529'),
"subject":"math,english,economics",
"score":pd.Categorical(["68,62,49","58,72,87","28,32,46","48,72,66","46,25,93"]),
"Department":pd.Categorical(["Economics","Computer Science","Sociology","Business","Math"])})
---Input DataFrame---
id day subject score Department
0 1 2019-05-29 math,english,economics 68,62,49 Economics
1 2 2019-05-29 math,economics 58,87 Computer Science
2 3 2019-05-29 philosophy,english,business 28,32,46 Sociology
3 4 2019-05-29 physics,sociology 72,66 Business
4 5 2019-05-29 Math 93 Math
输出如下所示
---Output DataFrame---
id day Department Math business economics english math philosophy physics sociology
1 2019-05-29 Economics NaN NaN 49 62 68 NaN NaN NaN
2 2019-05-29 Computer Science NaN NaN 87 NaN 58 NaN NaN NaN
3 2019-05-29 Sociology NaN 46 NaN 32 NaN 28 NaN NaN
4 2019-05-29 Business NaN NaN NaN NaN NaN NaN 72 66
5 2019-05-29 Math 93 NaN NaN NaN NaN NaN NaN NaN
我的方法是
- 我用“,”分割主题和分数列。
- 展开主题和分数列中的元素列表以将行分隔为 pandas.Series
- 加入每个 pandas.Series 以制作新的数据框
- 透视在步骤 3 中创建的新数据框
- 在原始数据框中删除主题和分数列
- 加入第 4 步和第 5 步中制作的每个数据框
我的代码如下
df["subject"] = df["subject"].str.split(",")
df["score"] = df["score"].str.split(",")
subject = df.apply(lambda x: pd.Series(x['subject']),axis=1).stack().reset_index(level=1, drop=True)
score = df.apply(lambda x: pd.Series(x['score']),axis=1).stack().reset_index(level=1, drop=True)
subject.name = 'subject'
score.name = 'score'
subject_score = pd.concat([subject, score],join='outer', axis=1)
pdf = df.drop('subject', axis=1).drop("score", axis=1).join(subject_score)
pivot = pdf.pivot(columns="subject",values="score")
concate_table = df.drop("subject",axis = 1).drop("score", axis=1)
output = concate_table.join(pivot)
我最近才开始学习 pandas,我相信这不是列转置的最佳方法。
如果您能给我一些如何优化此代码的建议,我将不胜感激。
先感谢您。
解决方案
我会定义一个自定义函数来使用和和一系列stack_str
将字符串列解压缩到数据框。expand=True
stack
reset_index
应用于stack_str
2 列字符串以组成df1
2 列。
接下来,pivot
继续df1
将subject
值设为 ascolumns
和scores
as values
。最后,加入df
已经删除的 2 列具有字符串的列。
def stack_str(x):
s = x.str.split(',', expand=True).stack().reset_index(level=-1, drop=True)
return s
df1 = df[['subject', 'score']].apply(stack_list)
Out[984]:
subject score
0 math 68
0 english 62
0 economics 49
1 math 58
1 economics 87
2 philosophy 28
2 english 32
2 business 46
3 physics 72
3 sociology 66
4 Math 93
df2 = df.drop(['subject', 'score'], axis=1).join(df1.pivot(columns='subject', values='score'))
Out[986]:
id day Department Math business economics english math \
0 1 2019-05-29 Economics NaN NaN 49 62 68
1 2 2019-05-29 Computer_Science NaN NaN 87 NaN 58
2 3 2019-05-29 Sociology NaN 46 NaN 32 NaN
3 4 2019-05-29 Business NaN NaN NaN NaN NaN
4 5 2019-05-29 Math 93 NaN NaN NaN NaN
philosophy physics sociology
0 NaN NaN NaN
1 NaN NaN NaN
2 28 NaN NaN
3 NaN 72 66
4 NaN NaN NaN
推荐阅读
- node.js - 在 hapi.js v.17 上将 Mongoose 结果作为 CSV 流响应流式传输
- powershell - 删除以“pict”开头、以“.dat”结尾且超过 2 年不工作的文件的 Powershell 脚本
- c++ - 单个变量如何存储多个值?
- java - textswitcher toast 上一个文本
- jetty - Apache zeppelin 响应体
- python - 安装 dlib 引用 cmake 时出错
- rabbitmq - 按兔子队列中的属性过滤消息
- jekyll - 如何在jekyll中列出页面类别和相应的帖子数
- node.js - 在RestApi express js中将json文件转换为csv
- excel - 如何在EXCEL中运行脚本?