首页 > 解决方案 > 在熊猫中按列 dtype 进行 if-else

问题描述

格式化 pandas 的输出

我正在尝试以一种我可以在文字处理器中使用最少的格式自动获取 pandas 的输出。我将描述性统计用作实践案例,因此我尝试使用df[variable].describe(). 我的问题是.describe()根据列的不同做出不同的响应dtype(如果我理解正确的话)。

在数值列的情况下describe()产生这个输出:

count    306.000000
mean      36.823529
std        6.308587
min       10.000000
25%       33.000000
50%       37.000000
75%       41.000000
max       50.000000
Name: gses_tot, dtype: float64

但是,对于分类列,它会产生:

count        306
unique         3
top       Female
freq         166
Name: gender, dtype: object

由于这种差异,我需要不同的代码来捕获我需要的信息,但是,我似乎无法让我的代码处理分类变量。

我试过的

我尝试了几个不同的版本:

for v in df.columns:
    if df[v].dtype.name == 'category': #i've also tried 'object' here
        c, u, t, f, = df[v].describe()
        print(f'******{str(v)}******')
        print(f'Largest category = {t}')
        print(f'Percentage = {(f/c)*100}%')        
    else:
        c, m, std, mi, tf, f, sf, ma, = df[v].describe()
        print(f'******{str(v)}******')
        print(f'M = {m}')
        print(f'SD = {std}')
        print(f'Range = {float(ma) - float(mi)}')
        print(f'\n')

块中的代码else工作正常,但是当我来到分类列时,我收到以下错误

******age****** #this is the output I want to a numberical column
M = 34.21568627450981
SD = 11.983015946197659
Range = 53.0


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-24-f077cc105185> in <module>
      6         print(f'Percentage = {(f/c)*100}')
      7     else:
----> 8         c, m, std, mi, tf, f, sf, ma, = df[v].describe()
      9         print(f'******{str(v)}******')
     10         print(f'M = {m}')

ValueError: not enough values to unpack (expected 8, got 4)

我想要发生的事情是这样的

******age****** #this is the output I want to a numberical column
M = 34.21568627450981
SD = 11.983015946197659
Range = 53.0


******gender******
Largest category = female
Percentage = 52.2%


I believe that the issue is how I'm setting up the if statement with the dtype
and I've rooted around to try to find out how to access the dtype properly but I can't seem to make it work. 

Advice would be much appreciated.

标签: pythonpandas

解决方案


您可以检查 describe 的输出中包含哪些字段并打印相应的部分:

import pandas as pd

df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']), 'numeric': [1, 2, 3], 'object': ['a', 'b', 'c']})

for v in df.columns:
    desc = df[v].describe()
    print(f'******{str(v)}******')
    if 'top' in desc:
        print(f'Largest category = {desc["top"]}')
        print(f'Percentage = {(desc["freq"]/desc["count"])*100:.1f}%')        
    else:
        print(f'M = {desc["mean"]}')
        print(f'SD = {desc["std"]}')
        print(f'Range = {float(desc["max"]) - float(desc["min"])}')

推荐阅读