首页 > 解决方案 > 展平数据框

问题描述

下面我有一个包含几列的数据框,我想在展平后保留所有这些列。这种扁平化应该发生,name_id因为它代表纵向数据。我想最终将它与其他数据帧合并,所以 groupby,虽然很好看起来不是最终应用机器学习技术的好方法。话虽如此,我知道这里有一些非常聪明的人可以根据您自己的经验就如何解决这个问题提供意见。任何想法都非常感谢!

df = pd.DataFrame({'name_id':[1254, 1359, 1254, 1296, 1353, 2656, 1353], 
                   'enrollment_term':['spring 2018', 'spring 2018', 'fall 2018', 'spring 2018', 'spring 2018', 'fall 2020', 'fall 2018'],
                   'gpa_term': [2.93, 3.67, 1.65, 4.00, 3.95, 2.92, 2.82],
                   'course':['math', 'geom', 'alg', 'history', 'art', 'geography', 'donkey ownership'],
                   'dorm_res':[1,1,1,0,0,1,1],
                   'home_work':[0.56, 0.89, 0.95, 0.7, 0.3, 0.64, 0.49]
                   })

df

在此处输入图像描述

标签: pythonpandaspivot-table

解决方案


很难确切地说出你想做什么。以下是扁平化数据框的 3 种方法。

x=df.values.flatten()
x


x=df.stack().values
x


import numpy as np
np.reshape(df.values, (1,df.shape[0]*df.shape[1]))

结果:

array([[1254, 'spring 2018', 2.93, 'math', 1, 0.56, 1359, 'spring 2018',
        3.67, 'geom', 1, 0.89, 1254, 'fall 2018', 1.65, 'alg', 1, 0.95,
        1296, 'spring 2018', 4.0, 'history', 0, 0.7, 1353, 'spring 2018',
        3.95, 'art', 0, 0.3, 2656, 'fall 2020', 2.92, 'geography', 1,
        0.64, 1353, 'fall 2018', 2.82, 'donkey ownership', 1, 0.49]],
      dtype=object)

或者,也许您想使用 melt,如下所示:

pd.melt(df, id_vars=['name_id'])

结果:

    name_id         variable             value
0      1254  enrollment_term       spring 2018
1      1359  enrollment_term       spring 2018
2      1254  enrollment_term         fall 2018
3      1296  enrollment_term       spring 2018
4      1353  enrollment_term       spring 2018
5      2656  enrollment_term         fall 2020
6      1353  enrollment_term         fall 2018
7      1254         gpa_term              2.93
8      1359         gpa_term              3.67
9      1254         gpa_term              1.65
10     1296         gpa_term                 4
11     1353         gpa_term              3.95
12     2656         gpa_term              2.92
13     1353         gpa_term              2.82
14     1254           course              math
15     1359           course              geom
16     1254           course               alg
17     1296           course           history
18     1353           course               art
19     2656           course         geography
20     1353           course  donkey ownership
21     1254         dorm_res                 1
22     1359         dorm_res                 1
23     1254         dorm_res                 1
24     1296         dorm_res                 0
25     1353         dorm_res                 0
26     2656         dorm_res                 1
27     1353         dorm_res                 1
28     1254        home_work              0.56
29     1359        home_work              0.89
30     1254        home_work              0.95
31     1296        home_work               0.7
32     1353        home_work               0.3
33     2656        home_work              0.64
34     1353        home_work              0.49

或者,也许这个

x = df.pivot(index='name_id', columns='enrollment_term', values='gpa_term')
x

结果:

enrollment_term  fall 2018  fall 2020  spring 2018
name_id                                           
1254                  1.65        NaN         2.93
1296                   NaN        NaN         4.00
1353                  2.82        NaN         3.95
1359                   NaN        NaN         3.67
2656                   NaN       2.92          NaN

我在这里看不到任何纵向数据。


推荐阅读