首页 > 解决方案 > 如何解释熊猫 value_count() 输出?

问题描述

我有一个数据框(df):

        date                O_3     NO_2        SO_2        PM10        PM25        CO      Label
    0   2001-01-01 01:00:00 7.86    67.120003   26.459999   32.349998   12.505127   0.45    2.0
    1   2001-01-01 02:00:00 7.21    70.620003   20.879999   40.709999   12.505127   0.48    2.0
    2   2001-01-01 03:00:00 7.11    72.629997   21.580000   50.209999   12.505127   0.41    2.0
    3   2001-01-01 04:00:00 7.14    75.029999   19.270000   54.880001   12.505127   0.51    2.0
    4   2001-01-01 05:00:00 8.46    66.589996   13.640000   42.340000   12.505127   0.19    2.0
    ... ... ... ... ... ... ... ... ...
139603  2018-04-30 20:00:00 63.00   58.000000   4.000000    2.000000    2.000000    0.30    1.0
139604  2018-04-30 21:00:00 49.00   65.000000   4.000000    5.000000    4.000000    0.30    2.0
139605  2018-04-30 22:00:00 49.00   58.000000   4.000000    5.000000    3.000000    0.30    2.0
139606  2018-04-30 23:00:00 48.00   52.000000   4.000000    7.000000    7.000000    0.30    2.0
139607  2018-05-01 00:00:00 52.00   43.000000   4.000000    6.000000    4.000000    0.30    1.0

我想知道“标签”值的可变性,因此我:

# Variability of 'Labels' values
reshape_df['Label'].value_counts()

我得到:

2.0    80435
1.0    39393
3.0    15045
4.0     3295
5.0     1440
Name: Label, dtype: int64

我添加了一个新列,以便查看每一行的最大值列名:

# Create column with max pollutant name
reshape_df['Max_pollutant'] = reshape_df.eq(reshape_df.max(1), axis=0).dot(reshape_df.columns)

我得到:

date                        O_3     NO_2        SO_2        PM10        PM25        CO      Label       Max_pollutant
0       2001-01-01 01:00:00 7.86    67.120003   26.459999   32.349998   12.505127   0.45    2.0         NO_2
1       2001-01-01 02:00:00 7.21    70.620003   20.879999   40.709999   12.505127   0.48    2.0         NO_2
2       2001-01-01 03:00:00 7.11    72.629997   21.580000   50.209999   12.505127   0.41    2.0         NO_2
3       2001-01-01 04:00:00 7.14    75.029999   19.270000   54.880001   12.505127   0.51    2.0         NO_2
4       2001-01-01 05:00:00 8.46    66.589996   13.640000   42.340000   12.505127   0.19    2.0         NO_2
... ... ... ... ... ... ... ... ... ...
139603  2018-04-30 20:00:00 63.00   58.000000   4.000000    2.000000    2.000000    0.30    1.0         O_3
139604  2018-04-30 21:00:00 49.00   65.000000   4.000000    5.000000    4.000000    0.30    2.0         NO_2
139605  2018-04-30 22:00:00 49.00   58.000000   4.000000    5.000000    3.000000    0.30    2.0         NO_2
139606  2018-04-30 23:00:00 48.00   52.000000   4.000000    7.000000    7.000000    0.30    2.0         NO_2
139607  2018-05-01 00:00:00 52.00   43.000000   4.000000    6.000000    4.000000    0.30    1.0         O_3

如果我检查“Max_pollutant”的可变性:

# Variability of 'Max_pollutant' names
reshape_df['Max_pollutant'].value_counts()

我得到以下输出:

NO_2           91155
O_3            43166
PM10            4760
O_3NO_2          417
NO_2PM10          48
SO_2              23
O_3PM10           22
PM25              15
O_3NO_2PM10        2
Name: Max_pollutant, dtype: int64

我不太了解出现两种或多种污染物的值。例如,'O_3NO_2' = 417,这是否意味着 O_3 的最大值与 NO_2 相同?

如何打印这些行,特别是为了查看每种污染物的读数?

标签: pythonpandasdataframedata-science

解决方案


是的,那些“奇怪”的值是在 2 列中具有相同最大值的结果。

例如,您可以使用以下方式打印它们:

reshape_df.loc[reshape_df['Max_pollutant']=='O_3NO_2']

命令。


推荐阅读