python - Python - Pivot and create histograms from Pandas column, with missing values
问题描述
Having the following Data Frame:
name value count total_count
0 A 0 1 20
1 A 1 2 20
2 A 2 2 20
3 A 3 2 20
4 A 4 3 20
5 A 5 3 20
6 A 6 2 20
7 A 7 2 20
8 A 8 2 20
9 A 9 1 20
----------------------------------
10 B 0 10 75
11 B 5 30 75
12 B 6 20 75
13 B 8 10 75
14 B 9 5 75
I would like to pivot the data, grouping each row by the name value, then create columns based on the value & count columns aggregated into bins.
Explanation: I have 10 possible values, range 0-9, not all the values are present in each group. In the above example group B is missing values 1,2,3,4,7. I would like to create an histogram with 5 bins, ignore missing values and calculate the percentage of count for each bin. So the result will look like so:
name 0-1 2-3 4-5 6-7 8-9
0 A 0.150000 0.2 0.3 0.2 0.150000
1 B 0.133333 0.0 0.4 0.4 0.066667
For example for bin 0-1 of group A the calculation is the sum of count for the values 0,1 (1+2) divided by the total_count of group A
name 0-1
0 A (1+2)/20 = 0.15
I was looking into hist method and this StackOverflow question, but still struggling with figuring out what is the right approach.
解决方案
使用pd.cut
bin 您的功能,然后使用 adf.groupby().count()
和.unstack()
方法来获取您正在寻找的数据框。在 group by 期间,您可以使用任何聚合函数(.sum()、.count() 等)来获得您正在寻找的结果。如果您正在寻找示例,则下面的代码有效。
import pandas as pd
import numpy as np
df = pd.DataFrame(
data ={'name': ['Group A','Group B']*5,
'number': np.arange(0,10),
'value': np.arange(30,40)})
df['number_bin'] = pd.cut(df['number'], bins=np.arange(0,10))
# Option 1: Sums
df.groupby(['number_bin','name'])['value'].sum().unstack(0)
# Options 2: Counts
df.groupby(['number_bin','name'])['value'].count().unstack(0)
原始数据中的空值不会影响结果。
推荐阅读
- java - 如何在java中使用逗号分隔符写入csv文件时删除字符串中的逗号
- javascript - mongoose 查询中缺少属性
- javascript - Suitescript ClientScript 错误 JS_EXCEPTION - TypeError:无法读取未定义的属性“名称”
- naming-conventions - Serenity 中带有动词名称的类
- react-native - 我应该如何将我的单个 js 文件分成 logic、presenter 和 css 文件?
- javascript - Woocommerce Checkout Field 编辑器,Datepicker 限制日期数组并限制最近的 5 天
- python - 可浏览的 api django rest auth 中未显示旧密码字段
- go - 如何停止在封闭通道上接收值?
- python - random.choice() 按顺序等效
- apache-spark - pyspark中的双引号字符问题