首页 > 解决方案 > 有没有办法在熊猫中进行分组,然后在另一列具有指定值的情况下计算唯一性?

问题描述

我有一个包含许多列的熊猫数据框。为简单起见,假设列是“country”、“time_bucket”、“category”和“id”。“类别”可以是“员工”或“学生”。

import pandas as pd
    data = {'country':  ['A', 'A', 'A', 'B', 'B',],
            'time_bucket': ['8', '8', '8', '8', '9'],
            'category': ['staff', 'staff', 'student','student','staff'],
            'id': ['101', '172', '122', '142', '132'],
            }
        
        df = pd.DataFrame (data, columns = ['country','time_bucket', 'category', 'id'])
df


country time_bucket category    id
0   A      8      staff        101
1   A      8      staff        172
2   A      8      student      122
3   B      8      student      142
4   B      9      staff        132

我想找出一个国家在特定时间间隔内的员工总数和学生总数,并将它们添加为新列。

我可以得到一个国家在特定时间间隔内的总人数:

df['persons_count'] = df.groupby(['time_bucket','country'])['id'].transform('nunique')

country time_bucket category    id  persons_count
0   A      8         staff      101    3
1   A      8         staff      172    3
2   A      8         student    122    3
3   B      8         student    142    1
4   B      9         staff      132    1

但是,我无法弄清楚如何考虑“类型”并将其添加到我的代码中。

我想要这样的东西:

country time_bucket category    id  staff_count student_count
0   A     8          staff      101     2           1  
1   A     8          staff      172     2           1
2   A     8          student    122     2           1
3   B     8          student    142     0           1
4   B     9          staff      132     1           0

任何建议将不胜感激!


添加一个显示需要唯一“id”计数的新示例

import pandas as pd
data = {'country':  ['A', 'A', 'A', 'A','B', 'B',],
                'time_bucket': ['8', '8', '8', '8', '8','9'],
                'category': ['staff', 'staff', 'student','student','student','staff'],
                'id': ['101', '172', '122', '122','142', '132'],
                }
        
df = pd.DataFrame (data, columns = ['country','time_bucket', 'category', 'id'])
df

country time_bucket category    id
0   A     8         staff       101
1   A     8         staff       172
2   A     8         student     122
3   A     8         student     122
4   B     8         student     142
5   B     9         staff       132

我想要这样的东西:

country time_bucket category    id  staff_count student_count
0   A     8          staff      101     2           1  
1   A     8          staff      172     2           1
2   A     8          student    122     2           1
3   A     8          student    122     2           1
4   B     8          student    142     0           1
5   B     9          staff      132     1           0

标签: pythonpandasdataframepandas-groupbyunique

解决方案


我们可以使用groupby操作 with applyapply接受一个函数作为参数,它将接收每个分组的子数据帧。使用您提供的数据并按 [country, time_bucket] 分组,它将收到 [A,8] 的 3 行,[B,8] 的 1 行和 [B,9] 的 1 行

要获得您要求的输出:

import pandas as pd
from collections import Counter

data = {'country':  ['A', 'A', 'A', 'B', 'B'],
        'time_bucket': ['8', '8', '8', '8', '9'],
        'category': ['staff', 'staff', 'student', 'student', 'staff'],
        'id': ['101', '172', '122', '142', '132'],
        }

df = pd.DataFrame(data, columns=['country', 'time_bucket', 'category', 'id'])


def category_counter(row):
    counter = Counter(row.category.tolist())
    for k in ['staff', 'student']:
        row[k+'_count'] = counter[k]
    return row


df.groupby(['country', 'time_bucket']).apply(category_counter)

输出:

  country time_bucket category   id  staff_count  student_count
0       A           8    staff  101            2              1
1       A           8    staff  172            2              1
2       A           8  student  122            2              1
3       B           8  student  142            0              1
4       B           9    staff  132            1              0

不返回重复数据的替代方法:

import pandas as pd
from collections import Counter

data = {'country':  ['A', 'A', 'A', 'B', 'B'],
        'time_bucket': ['8', '8', '8', '8', '9'],
        'category': ['staff', 'staff', 'student', 'student', 'staff'],
        'id': ['101', '172', '122', '142', '132'],
        }

df = pd.DataFrame(data, columns=['country', 'time_bucket', 'category', 'id'])


def category_counter(row):
    counter = Counter(row.category.tolist())
    return_data = {}
    for k in ['staff', 'student']:
        return_data[k+'_count'] = counter[k]

    return pd.Series(return_data)


df.groupby(['country', 'time_bucket']).apply(category_counter)

输出:

                     staff_count  student_count
country time_bucket
A       8                      2              1
B       8                      0              1
        9                      1              0

推荐阅读