首页 > 解决方案 > Pandas 操作 stack() 从 MultiIndex 中删除分类类型

问题描述

我有一个 pandas DataFrame,它的行和列都有一个分类 MultiIndex。但是,将索引从列移动到行的 stack() 操作会剥离 Categorical 类型,即它使用字符串创建新索引。

为什么要这样做?有没有办法阻止它?我正在寻找一种不需要在每次操作后手动重置分类类型的解决方案。

请注意,其他人发现了与 melt() 剥离分类类型的类似问题 [https://stackoverflow.com/questions/64900604/categorical-column-after-melt-in-pandas, https://stackoverflow.com/questions/63138258 /why-is-pandas-melt-messing-with-my-dtypes]

代码

下面的代码说明了这个问题。我在 stack() 操作之前和之后打印出索引的每个级别的类型。Stack 似乎保留了第一层的 Categorical 类型,但剥离了更高层的 Categorical 类型。

import pandas as pd
import numpy as np
# -----------------------------------
# dtype for each level of row index or column index
def get_Index_level_dtypes(df, axis):
    I = df.index if axis==0 else df.columns
    return [I.get_level_values(i).dtype for i in range(I.nlevels)]

# Print:
# (i)   name of DataFrame (or Series)
# (ii)  dtype of levels for axis = 0
# (iii) dtype of levels for axis = 1  (if not Series)
def print_Index_level_dtypes(df, S):
    print('-'*100,"\n", S, " = \n", df, "\n")  # print df name
    for i in range(2 if isinstance(df, pd.DataFrame) else 1):
        print("Level data type, axis = ",i,":")
        for q in get_Index_level_dtypes(df, axis=i):
            print(q)
# -----------------------------------
midx = pd.MultiIndex.from_arrays(
    [
    pd.Categorical(['A1','A1','A2','A2']),
    pd.Categorical(['B1','B2','B1','B2']),
    pd.Categorical(['C1','C1','C1','C1']),
    ])
np.random.seed(0)
df = pd.DataFrame(
    np.random.randn(2, 4), 
    columns=midx,
    index = pd.Categorical(['Row1','Row2'])
)
# --------------------------------------
print_Index_level_dtypes(df,"Orig")
#
# • Stack one level:
#    row index: Keeps categorical type
#    col index: strips categorical type (types are "object", ie string)
print_Index_level_dtypes(df.stack(level = [0]), "Stack_level_0")   # same behavior for all level=[i]
#
# • Stack multiple levels:
#    row index: Keeps 1st categorical type, strips rest
#    col index: strips categorical type (types are "object", ie string)
print_Index_level_dtypes(df.stack(level = [0, 1]),    "Stack_level_01")
print_Index_level_dtypes(df.stack(level = [0, 1, 2]), "Stack_level_012")

输出

---------------------------------------------------------------------------------------------------- 
 Orig  = 
             A1                  A2          
            B1        B2        B1        B2
            C1        C1        C1        C1
Row1  1.764052  0.400157  0.978738  2.240893
Row2  1.867558 -0.977278  0.950088 -0.151357 

Level data type, axis =  0 :
category
Level data type, axis =  1 :
category
category
category
---------------------------------------------------------------------------------------------------- 
 Stack_level_0  = 
                B1        B2
               C1        C1
Row1 A1  1.764052  0.400157
     A2  0.978738  2.240893
Row2 A1  1.867558 -0.977278
     A2  0.950088 -0.151357 

Level data type, axis =  0 :
category
category
Level data type, axis =  1 :
object
object
---------------------------------------------------------------------------------------------------- 
 Stack_level_01  = 
                   C1
Row1 A1 B1  1.764052
        B2  0.400157
     A2 B1  0.978738
        B2  2.240893
Row2 A1 B1  1.867558
        B2 -0.977278
     A2 B1  0.950088
        B2 -0.151357 

Level data type, axis =  0 :
category
category
object
Level data type, axis =  1 :
object
---------------------------------------------------------------------------------------------------- 
 Stack_level_012  = 
 Row1  A1  B1  C1    1.764052
          B2  C1    0.400157
      A2  B1  C1    0.978738
          B2  C1    2.240893
Row2  A1  B1  C1    1.867558
          B2  C1   -0.977278
      A2  B1  C1    0.950088
          B2  C1   -0.151357
dtype: float64 

Level data type, axis =  0 :
category
category
object
object

标签: pythonpandasstackmulti-indexcategorical-data

解决方案


推荐阅读