首页 > 解决方案 > 鉴于要分发的值是非数字的,如何将数据帧从长更改为宽?

问题描述

您好,从 R 迁移到 python,我想弄清楚如何将 Long 格式更改为宽格式,因为要分发的值是字符串或非数字

import pandas as pd
import numpy as np

示例虚拟数据

dummy_data = {'process_id':  [1,1,1,2,3,4,5,5,6],
    
    'purchase_date': ['2020-01-01', 
                      '2020-01-01',
                      '2020-01-01',
                      '2020-03-01',
                      '2020-03-02', 
                      '2020-03-04', 
                      '2020-07-04', 
                      '2020-07-04',
                      '2020-01-20'],
 'consumption_date': ['2021-02-01', 
                      '2021-02-01',
                      '2021-02-01',
                      '2021-03-01',
                      '2021-04-02', 
                      '2021-05-04', 
                      '2021-09-04', 
                      '2021-09-04',
                      '2021-08-04'],
    'discount_code': ['10',
                      '12', 
                      '11',
                      '13',
                      '90', 
                      '81', 
                      '82', 
                      '11',
                      '10']
    }

df = pd.DataFrame (dummy_data, columns = ['process_id','purchase_date','consumption_date','discount_code'])

虚拟数据输出示例:

   process_id purchase_date consumption_date discount_code

0           1    2020-01-01       2021-02-01            10
1           1    2020-01-01       2021-02-01            12
2           1    2020-01-01       2021-02-01            11
3           2    2020-03-01       2021-03-01            13
4           3    2020-03-02       2021-04-02            90
5           4    2020-03-04       2021-05-04            81
6           5    2020-07-04       2021-09-04            82
7           5    2020-07-04       2021-09-04            11
8           6    2020-01-20       2021-08-04            10

所以每个 process_id 可能有 N 个 discount_code

我想要实现的目标:如果一个process_id有多个discount_code,我想创建N个discount_code_columns

示例预期数据

expected_data = {'process_id':  [1,2,3,4,5,6],
    
    'purchase_date': ['2020-01-01', 
                      '2020-03-01',
                      '2020-03-02', 
                      '2020-03-04', 
                      '2020-07-04',
                      '2020-01-20'],
 'consumption_date': ['2021-02-01', 
                      '2021-03-01',
                      '2021-04-02', 
                      '2021-05-04', 
                      '2021-09-04',
                      '2021-08-04'],
    'discount_code_1': ['10', '13',   '90',   '81', '  82',  '10'],
    'discount_code_2': ['12', np.nan, np.nan, np.nan, '11', np.nan,],
    'discount_code_3': ['11', np.nan, np.nan, np.nan, np.nan, np.nan], 
    }
df2 = pd.DataFrame (expected_data, columns = ['process_id','purchase_date','consumption_date','discount_code_1','discount_code_2','discount_code_3'])

示例预期数据输出

   process_id purchase_date consumption_date discount_code_1 discount_code_2 discount_code_3
0           1    2020-01-01       2021-02-01              10              12              11
1           2    2020-03-01       2021-03-01              13             NaN             NaN
2           3    2020-03-02       2021-04-02              90             NaN             NaN
3           4    2020-03-04       2021-05-04              81             NaN             NaN
4           5    2020-07-04       2021-09-04              82              11             NaN
5           6    2020-01-20       2021-08-04              10             NaN             NaN

所以我尝试的是用户 pandas pivot

df1 = df.pivot_table(index=['process_id','purchase_date','consumption_date'], 
columns='process_id', values=['discount_code'], aggfunc='first')
df1.columns = df1.columns.droplevel()
df1 = df1.reset_index()
df1.columns=df1.columns.tolist()


   process_id purchase_date consumption_date    1    2    3    4    5    6
0           1    2020-01-01       2021-02-01   10  NaN  NaN  NaN  NaN  NaN
1           2    2020-03-01       2021-03-01  NaN   13  NaN  NaN  NaN  NaN
2           3    2020-03-02       2021-04-02  NaN  NaN   90  NaN  NaN  NaN
3           4    2020-03-04       2021-05-04  NaN  NaN  NaN   81  NaN  NaN
4           5    2020-07-04       2021-09-04  NaN  NaN  NaN  NaN   82  NaN
5           6    2020-01-20       2021-08-04  NaN  NaN  NaN  NaN  NaN   10

但这似乎并不正确。感谢任何帮助!

标签: pythonpandas

解决方案


让我们unstack在创建后尝试重塑MultiIndex以唯一标识行

c = ['process_id', 'purchase_date', 'consumption_date']
df1 = df.set_index([*c, df.groupby(c).cumcount().add(1).astype(str)]).unstack()
df1.columns = df1.columns.map('_'.join)

>>> df1.reset_index()

   process_id purchase_date consumption_date discount_code_1 discount_code_2 discount_code_3
0           1    2020-01-01       2021-02-01              10              12              11
1           2    2020-03-01       2021-03-01              13             NaN             NaN
2           3    2020-03-02       2021-04-02              90             NaN             NaN
3           4    2020-03-04       2021-05-04              81             NaN             NaN
4           5    2020-07-04       2021-09-04              82              11             NaN
5           6    2020-01-20       2021-08-04              10             NaN             NaN

推荐阅读