首页 > 解决方案 > Python - Pivot and create histograms from Pandas column, with missing values

问题描述

Having the following Data Frame:

   name  value  count  total_count
0     A      0      1           20
1     A      1      2           20
2     A      2      2           20
3     A      3      2           20
4     A      4      3           20
5     A      5      3           20
6     A      6      2           20
7     A      7      2           20
8     A      8      2           20
9     A      9      1           20
----------------------------------
10    B      0     10           75
11    B      5     30           75
12    B      6     20           75
13    B      8     10           75
14    B      9      5           75

I would like to pivot the data, grouping each row by the name value, then create columns based on the value & count columns aggregated into bins.

Explanation: I have 10 possible values, range 0-9, not all the values are present in each group. In the above example group B is missing values 1,2,3,4,7. I would like to create an histogram with 5 bins, ignore missing values and calculate the percentage of count for each bin. So the result will look like so:

  name       0-1  2-3  4-5  6-7       8-9
0    A  0.150000  0.2  0.3  0.2  0.150000
1    B  0.133333  0.0  0.4  0.4  0.066667

For example for bin 0-1 of group A the calculation is the sum of count for the values 0,1 (1+2) divided by the total_count of group A

  name       0-1
0    A       (1+2)/20 = 0.15

I was looking into hist method and this StackOverflow question, but still struggling with figuring out what is the right approach.

标签: pythonpandaspivot-tablehistogram

解决方案


使用pd.cutbin 您的功能,然后使用 adf.groupby().count().unstack()方法来获取您正在寻找的数据框。在 group by 期间,您可以使用任何聚合函数(.sum()、.count() 等)来获得您正在寻找的结果。如果您正在寻找示例,则下面的代码有效。

import pandas as pd
import numpy as np

df = pd.DataFrame(
    data ={'name': ['Group A','Group B']*5,
           'number': np.arange(0,10), 
           'value': np.arange(30,40)})
df['number_bin'] = pd.cut(df['number'], bins=np.arange(0,10))
# Option 1: Sums
df.groupby(['number_bin','name'])['value'].sum().unstack(0)
# Options 2: Counts
df.groupby(['number_bin','name'])['value'].count().unstack(0)

原始数据中的空值不会影响结果。


推荐阅读