首页 > 解决方案 > DataFrame行列转换优化

问题描述

我有一个DataFrame需要转换的。将a, b, c, and d每年的列从转换rowscolumnsdf是转换前的数据,df1是转换后的数据。哪些语句可以优化?我的代码如下:

import numpy as np
import pandas as pd

np.random.seed(2021)
df = pd.DataFrame()
years = np.arange(2020, 2016, -1)
df['year'] = years
df['a'] = 1
t1 = pd.DataFrame()
t1['code'] = np.arange(1, 1001)
t1['a'] = 1
df = pd.merge(df, t1, how='outer')
columns = ['a', 'b', 'c', 'd']
df[columns] = np.random.rand(len(df) * len(columns)).reshape(len(df), len(columns)) * 100
df = (df[['code', 'year'] + columns]
      .sort_values(by=['code', 'year'], ascending=[True, False])
      .reset_index(drop=True))
print('\nOriginal DataFrame df:')
print(df.head(10))

t1 = df.drop_duplicates('code')['code']
for i in years:
    t2 = df[df.year == i].rename(columns={'a': 'a' + str(i), 'b': 'b' + str(i), 'c': 'c' + str(i), 'd': 'd' + str(i)})
    t2 = t2.drop(['year'], axis=1)
    t1 = pd.merge(t1, t2)
columns2 = ['code', 'a2020', 'a2019', 'a2018', 'a2017', 'b2020', 'b2019', 'b2018', 'b2017', 'c2020', 'c2019', 'c2018',
            'c2017', 'd2020', 'd2019', 'd2018', 'd2017']
df1 = t1[columns2].sort_values(by='code').reset_index(drop=True)
print('\nDesired result DataFrame df1:')
print(df1.head(10))

原文DataFrame df

   code  year          a          b          c          d
0     1  2020  60.597828  73.336936  13.894716  31.267308
1     1  2019  13.789299  13.785302  42.329889  26.746788
2     1  2018  20.275643  17.800622  11.384761  48.751234
3     1  2017  84.178312  53.173128  55.714519  85.602826
4     2  2020  99.724328  12.816238  17.899311  75.292543
5     2  2019  84.299843  12.070816  84.099255  88.540943
6     2  2018  68.374138  54.314583  55.619371  78.855774
7     2  2017  92.724593  31.877149  17.871666   9.245337
8     3  2020  66.216051  78.431013   9.689440   5.857129
9     3  2019  30.026980  68.526491  76.161394  14.701888

期望的结果DataFrame df1

   code      a2020      a2019  ...      d2019      d2018      d2017
0     1  60.597828  13.789299  ...  26.746788  48.751234  85.602826
1     2  99.724328  84.299843  ...  88.540943  78.855774   9.245337
2     3  66.216051  30.026980  ...  14.701888  80.382850  93.540599
3     4  96.239599  19.826018  ...  11.576210  62.692749  60.357877
4     5  61.652471  63.574622  ...  89.279561  47.126554  23.663626
5     6  45.214524  23.944341  ...  67.501414  87.742089  93.640579
6     7  58.370402  49.246463  ...  40.886723  49.001126  14.725770
7     8  73.222503  37.440411  ...  76.945162  53.845775  96.251071
8     9  82.801437   4.264280  ...  57.847433  45.442707  70.354386
9    10  30.106862  91.145614  ...  36.387508  39.346212  85.077131

标签: pythonpandasdataframe

解决方案


您可以使用以下命令执行此操作pivot

result = df.pivot(index="code", values = ["a", "b", "c", "d"], columns = "year")
# flatten and rename your columns if necessary:
result.columns = result.columns.to_flat_index().map(lambda x: "".join([x[0],str(x[1])]))

          a2017      a2018      a2019      a2020  ...      d2017      d2018      d2019      d2020
code                                              ...                                            
1     54.280967  52.179967  50.494392   8.822997  ...  10.077397  47.295947  81.042424  81.797062
2     88.683948  28.406555  92.679674  62.223895  ...  29.423042  56.582891  20.835779  46.707585
3           NaN        NaN   9.467721  34.100070  ...        NaN        NaN  35.135736   4.749653

推荐阅读