首页 > 解决方案 > 新数据帧是旧数据帧中多列上 value_counts 的结果

问题描述

我有一个数据框,其中数据看起来像这样。

        pack  pear  carrot  grape  berry  apple  sum 3rd pack  box
0      1   1.0     3.0    1.0    2.0    1.0    8      NaN  1.0
1      2   1.0     1.0    3.0    0.0    3.0    8      NaN  1.0
2      3   1.0     1.0    3.0    1.0    3.0    9     True  1.0
3      4   1.0     1.0    2.0    3.0    1.0    8      NaN  2.0
4      5   2.0     3.0    4.0    0.0    0.0    9      NaN  2.0

带有水果名称的 5 列只有 0 到 6 之间的值,我想要一个数据框来显示这些单个值的分布。它应该像这样

         pear    carrot     grape     berry     apple
0.0  0.163636  0.145455  0.090909  0.218182  0.236364
1.0  0.472727  0.200000  0.218182  0.418182  0.454545
2.0  0.218182  0.309091  0.290909  0.254545  0.145455
3.0  0.090909  0.272727  0.236364  0.109091  0.127273
4.0  0.036364  0.072727  0.090909       NaN  0.036364
5.0  0.018182       NaN  0.036364       NaN       NaN
6.0       NaN       NaN  0.036364       NaN       NaN 

我通过为每个水果创建一个系列然后将系列合并在一起来做到这一点

pears_per_pack = df['pear'].value_counts(normalize=True)
carrots_per_pack = df['carrot'].value_counts(normalize=True)
grapes_per_pack = df['grape'].value_counts(normalize=True)
berrys_per_pack = df['berry'].value_counts(normalize=True)
apples_per_pack = df['apple'].value_counts(normalize=True)

df_lst = [pears_per_pack,carrots_per_pack,grapes_per_pack,berrys_per_pack,apples_per_pack]
fruit_df = pd.concat(df_lst, axis = 1)

现在,这只有 5 列,相当容易手工完成,但这整件事对我来说或多或少是一个学习机会,我认为这必须打破 DRY 协议。所以我问是否有更好的方法来做这样的事情更合适/如果我需要使用更多的列来执行此操作会更适用。

标签: python-3.xpandas

解决方案


您可以调用pd.Series.value_counts每列,apply然后调用reindex结果0..6以确保该范围内的每个值都出现:

>>> cols = ["pear", "carrot", "grape", "berry", "apple"]

>>> val_counts = df[cols].apply(pd.Series.value_counts, normalize=True)
>>> val_counts

     pear  carrot  grape  berry  apple
0.0   NaN     NaN    NaN    0.4    0.2
1.0   0.8     0.6    0.2    0.2    0.4
2.0   0.2     NaN    0.2    0.2    NaN
3.0   NaN     0.4    0.4    0.2    0.4
4.0   NaN     NaN    0.2    NaN    NaN

>>> result = val_counts.reindex(pd.RangeIndex(start=0, stop=6+1))
>>> result

   pear  carrot  grape  berry  apple
0   NaN     NaN    NaN    0.4    0.2
1   0.8     0.6    0.2    0.2    0.4
2   0.2     NaN    0.2    0.2    NaN
3   NaN     0.4    0.4    0.2    0.4
4   NaN     NaN    0.2    NaN    NaN
5   NaN     NaN    NaN    NaN    NaN
6   NaN     NaN    NaN    NaN    NaN

cols是要应用的列;它是手动编写的,但可以根据情况自动选择。例如,对于从第 1 位到第 6 位的列或以某个名称开头的列等。


推荐阅读