python - 鉴于要分发的值是非数字的,如何将数据帧从长更改为宽?
问题描述
您好,从 R 迁移到 python,我想弄清楚如何将 Long 格式更改为宽格式,因为要分发的值是字符串或非数字
import pandas as pd
import numpy as np
示例虚拟数据
dummy_data = {'process_id': [1,1,1,2,3,4,5,5,6],
'purchase_date': ['2020-01-01',
'2020-01-01',
'2020-01-01',
'2020-03-01',
'2020-03-02',
'2020-03-04',
'2020-07-04',
'2020-07-04',
'2020-01-20'],
'consumption_date': ['2021-02-01',
'2021-02-01',
'2021-02-01',
'2021-03-01',
'2021-04-02',
'2021-05-04',
'2021-09-04',
'2021-09-04',
'2021-08-04'],
'discount_code': ['10',
'12',
'11',
'13',
'90',
'81',
'82',
'11',
'10']
}
df = pd.DataFrame (dummy_data, columns = ['process_id','purchase_date','consumption_date','discount_code'])
虚拟数据输出示例:
process_id purchase_date consumption_date discount_code
0 1 2020-01-01 2021-02-01 10
1 1 2020-01-01 2021-02-01 12
2 1 2020-01-01 2021-02-01 11
3 2 2020-03-01 2021-03-01 13
4 3 2020-03-02 2021-04-02 90
5 4 2020-03-04 2021-05-04 81
6 5 2020-07-04 2021-09-04 82
7 5 2020-07-04 2021-09-04 11
8 6 2020-01-20 2021-08-04 10
所以每个 process_id 可能有 N 个 discount_code
我想要实现的目标:如果一个process_id有多个discount_code,我想创建N个discount_code_columns
示例预期数据
expected_data = {'process_id': [1,2,3,4,5,6],
'purchase_date': ['2020-01-01',
'2020-03-01',
'2020-03-02',
'2020-03-04',
'2020-07-04',
'2020-01-20'],
'consumption_date': ['2021-02-01',
'2021-03-01',
'2021-04-02',
'2021-05-04',
'2021-09-04',
'2021-08-04'],
'discount_code_1': ['10', '13', '90', '81', ' 82', '10'],
'discount_code_2': ['12', np.nan, np.nan, np.nan, '11', np.nan,],
'discount_code_3': ['11', np.nan, np.nan, np.nan, np.nan, np.nan],
}
df2 = pd.DataFrame (expected_data, columns = ['process_id','purchase_date','consumption_date','discount_code_1','discount_code_2','discount_code_3'])
示例预期数据输出
process_id purchase_date consumption_date discount_code_1 discount_code_2 discount_code_3
0 1 2020-01-01 2021-02-01 10 12 11
1 2 2020-03-01 2021-03-01 13 NaN NaN
2 3 2020-03-02 2021-04-02 90 NaN NaN
3 4 2020-03-04 2021-05-04 81 NaN NaN
4 5 2020-07-04 2021-09-04 82 11 NaN
5 6 2020-01-20 2021-08-04 10 NaN NaN
所以我尝试的是用户 pandas pivot
df1 = df.pivot_table(index=['process_id','purchase_date','consumption_date'],
columns='process_id', values=['discount_code'], aggfunc='first')
df1.columns = df1.columns.droplevel()
df1 = df1.reset_index()
df1.columns=df1.columns.tolist()
process_id purchase_date consumption_date 1 2 3 4 5 6
0 1 2020-01-01 2021-02-01 10 NaN NaN NaN NaN NaN
1 2 2020-03-01 2021-03-01 NaN 13 NaN NaN NaN NaN
2 3 2020-03-02 2021-04-02 NaN NaN 90 NaN NaN NaN
3 4 2020-03-04 2021-05-04 NaN NaN NaN 81 NaN NaN
4 5 2020-07-04 2021-09-04 NaN NaN NaN NaN 82 NaN
5 6 2020-01-20 2021-08-04 NaN NaN NaN NaN NaN 10
但这似乎并不正确。感谢任何帮助!
解决方案
让我们unstack
在创建后尝试重塑MultiIndex
以唯一标识行
c = ['process_id', 'purchase_date', 'consumption_date']
df1 = df.set_index([*c, df.groupby(c).cumcount().add(1).astype(str)]).unstack()
df1.columns = df1.columns.map('_'.join)
>>> df1.reset_index()
process_id purchase_date consumption_date discount_code_1 discount_code_2 discount_code_3
0 1 2020-01-01 2021-02-01 10 12 11
1 2 2020-03-01 2021-03-01 13 NaN NaN
2 3 2020-03-02 2021-04-02 90 NaN NaN
3 4 2020-03-04 2021-05-04 81 NaN NaN
4 5 2020-07-04 2021-09-04 82 11 NaN
5 6 2020-01-20 2021-08-04 10 NaN NaN
推荐阅读
- c++ - 具有转发引用的静态多态性
- android - 如何给路径填充?
- python - 如何从网站中提取冠状病毒病例?
- javascript - 如何将新对象添加到嵌套在 ReactJS 中的对象数组中的对象数组中?
- rest - werkzeug.exceptions.BadRequestKeyError:400 错误请求:浏览器(或代理)发送了此服务器无法理解的请求。键错误:'id'
- sqlite - 使用预填充的 SQLite DB 数据运送 Flutter 应用程序
- flutter - 如何在 SingleChildScrollView 中实现不可滚动的列表视图
- r - ggplot/qqplotr 返回数据格式错误
- apache-spark - 如何在 HDP 2.6.5 中使用 Hive 仓库连接器
- python - 如何根据熊猫的范围在列中获取布尔值