python - 有没有办法在熊猫中进行分组,然后在另一列具有指定值的情况下计算唯一性?
问题描述
我有一个包含许多列的熊猫数据框。为简单起见,假设列是“country”、“time_bucket”、“category”和“id”。“类别”可以是“员工”或“学生”。
import pandas as pd
data = {'country': ['A', 'A', 'A', 'B', 'B',],
'time_bucket': ['8', '8', '8', '8', '9'],
'category': ['staff', 'staff', 'student','student','staff'],
'id': ['101', '172', '122', '142', '132'],
}
df = pd.DataFrame (data, columns = ['country','time_bucket', 'category', 'id'])
df
country time_bucket category id
0 A 8 staff 101
1 A 8 staff 172
2 A 8 student 122
3 B 8 student 142
4 B 9 staff 132
我想找出一个国家在特定时间间隔内的员工总数和学生总数,并将它们添加为新列。
我可以得到一个国家在特定时间间隔内的总人数:
df['persons_count'] = df.groupby(['time_bucket','country'])['id'].transform('nunique')
country time_bucket category id persons_count
0 A 8 staff 101 3
1 A 8 staff 172 3
2 A 8 student 122 3
3 B 8 student 142 1
4 B 9 staff 132 1
但是,我无法弄清楚如何考虑“类型”并将其添加到我的代码中。
我想要这样的东西:
country time_bucket category id staff_count student_count
0 A 8 staff 101 2 1
1 A 8 staff 172 2 1
2 A 8 student 122 2 1
3 B 8 student 142 0 1
4 B 9 staff 132 1 0
任何建议将不胜感激!
添加一个显示需要唯一“id”计数的新示例
import pandas as pd
data = {'country': ['A', 'A', 'A', 'A','B', 'B',],
'time_bucket': ['8', '8', '8', '8', '8','9'],
'category': ['staff', 'staff', 'student','student','student','staff'],
'id': ['101', '172', '122', '122','142', '132'],
}
df = pd.DataFrame (data, columns = ['country','time_bucket', 'category', 'id'])
df
country time_bucket category id
0 A 8 staff 101
1 A 8 staff 172
2 A 8 student 122
3 A 8 student 122
4 B 8 student 142
5 B 9 staff 132
我想要这样的东西:
country time_bucket category id staff_count student_count
0 A 8 staff 101 2 1
1 A 8 staff 172 2 1
2 A 8 student 122 2 1
3 A 8 student 122 2 1
4 B 8 student 142 0 1
5 B 9 staff 132 1 0
解决方案
我们可以使用groupby
操作 with apply
。apply
接受一个函数作为参数,它将接收每个分组的子数据帧。使用您提供的数据并按 [country, time_bucket] 分组,它将收到 [A,8] 的 3 行,[B,8] 的 1 行和 [B,9] 的 1 行
要获得您要求的输出:
import pandas as pd
from collections import Counter
data = {'country': ['A', 'A', 'A', 'B', 'B'],
'time_bucket': ['8', '8', '8', '8', '9'],
'category': ['staff', 'staff', 'student', 'student', 'staff'],
'id': ['101', '172', '122', '142', '132'],
}
df = pd.DataFrame(data, columns=['country', 'time_bucket', 'category', 'id'])
def category_counter(row):
counter = Counter(row.category.tolist())
for k in ['staff', 'student']:
row[k+'_count'] = counter[k]
return row
df.groupby(['country', 'time_bucket']).apply(category_counter)
输出:
country time_bucket category id staff_count student_count
0 A 8 staff 101 2 1
1 A 8 staff 172 2 1
2 A 8 student 122 2 1
3 B 8 student 142 0 1
4 B 9 staff 132 1 0
不返回重复数据的替代方法:
import pandas as pd
from collections import Counter
data = {'country': ['A', 'A', 'A', 'B', 'B'],
'time_bucket': ['8', '8', '8', '8', '9'],
'category': ['staff', 'staff', 'student', 'student', 'staff'],
'id': ['101', '172', '122', '142', '132'],
}
df = pd.DataFrame(data, columns=['country', 'time_bucket', 'category', 'id'])
def category_counter(row):
counter = Counter(row.category.tolist())
return_data = {}
for k in ['staff', 'student']:
return_data[k+'_count'] = counter[k]
return pd.Series(return_data)
df.groupby(['country', 'time_bucket']).apply(category_counter)
输出:
staff_count student_count
country time_bucket
A 8 2 1
B 8 0 1
9 1 0