首页 > 解决方案 > 如何根据来自不同列的值在组内标准化

问题描述

为不雅的标题道歉;我想不出另一种说法。这是我遇到的几种不同形式的问题,但找不到满意的答案。

示例:假设我一直在监控一周内喝了多少杯茶和咖啡:

In [17]: import random
    ...: test = pd.DataFrame({
    ...:     'drink' : ['tea'] * 5 +  ['coffee'] * 5,
    ...:     'day' : ['monday', 'tuesday', 'wednesday', 'thursday', 'friday'] * 2,
    ...:     'cups' : [random.randrange(1, 10)  for _ in range(10)]
    ...: })
    ...: test
    ...: 
    ...: 
Out[17]: 
    drink        day  cups
0     tea     monday     1
1     tea    tuesday     3
2     tea  wednesday     1
3     tea   thursday     7
4     tea     friday     1
5  coffee     monday     8
6  coffee    tuesday     1
7  coffee  wednesday     2
8  coffee   thursday     1
9  coffee     friday     1

为了比较金额,我想将它们标准化。我可以很容易地通过除以每天的总数来标准化——这几乎是熊猫标准化的标准示例:

In [18]: test['day_norm'] = test.groupby('day')['cups'].transform(lambda x : x / 
    ...: x.sum())
In [19]: test
Out[19]: 
    drink        day  cups  day_norm
0     tea     monday     1  0.111111
1     tea    tuesday     3  0.750000
2     tea  wednesday     1  0.333333
3     tea   thursday     7  0.875000
4     tea     friday     1  0.500000
5  coffee     monday     8  0.888889
6  coffee    tuesday     1  0.250000
7  coffee  wednesday     2  0.666667
8  coffee   thursday     1  0.125000
9  coffee     friday     1  0.500000


但是,假设我想通过将每个组除以星期一的值来查看值在一周内的变化 - 即我希望星期一为 1,然后每隔一天相对于该值。我设法想出了两种不同的方法,这两种方法似乎都令人费解。

一:我可以编写一个函数来过滤组数据框以找到星期一的值,然后将系列除以它:

In [20]: def normalize(df):
    ...:     monday_cups = df[df['day'] == 'monday']['cups'].mean()
    ...:     return df['cups'] / monday_cups
    ...: 
    ...: test['normalized cups'] = test.groupby('drink').apply(normalize).reset_i
    ...: ndex(level=0, drop=True)
    ...: test
    ...: 
    ...: 
Out[20]: 
    drink        day  cups  day_norm  normalized cups
0     tea     monday     1  0.111111            1.000
1     tea    tuesday     3  0.750000            3.000
2     tea  wednesday     1  0.333333            1.000
3     tea   thursday     7  0.875000            7.000
4     tea     friday     1  0.500000            1.000
5  coffee     monday     8  0.888889            1.000
6  coffee    tuesday     1  0.250000            0.125
7  coffee  wednesday     2  0.666667            0.250
8  coffee   thursday     1  0.125000            0.125
9  coffee     friday     1  0.500000            0.125

但这涉及到对索引的大量处理,以使它们与原始数据帧的索引相匹配。

二:我可以将数据重塑成一个宽格式的表格:

n [14]: summary = test.drop(columns=['normalized cups']).groupby(['drink', 'day'])['cups'].mean().unstack()

In [15]: summary
Out[15]: 
day     friday  monday  thursday  tuesday  wednesday
drink                                               
coffee       8       7         7        8          4
tea          9       9         4        8          4

然后除法变得更加简单,但是我必须花一些时间将其恢复为原始格式:

In [16]: summary.apply(lambda x : x / summary['monday']).stack().to_frame('norma
    ...: lized_cups').reset_index()
Out[16]: 
    drink        day  normalized_cups
0  coffee     friday         1.142857
1  coffee     monday         1.000000
2  coffee   thursday         1.000000
3  coffee    tuesday         1.142857
4  coffee  wednesday         0.571429
5     tea     friday         1.000000
6     tea     monday         1.000000
7     tea   thursday         0.444444
8     tea    tuesday         0.888889
9     tea  wednesday         0.444444

有没有更优雅的方法来做到这一点?我对排序数据框有一个模糊的想法,以便首先是星期一,然后做一些涉及groupbyand的事情first,但我想不出来!

标签: pandaspandas-groupby

解决方案


这就是我要做的

t2=test.loc[test.day=='monday',['drink','cups']].groupby('drink').cups.mean()
t2
Out[1282]:
drink
coffee    8
tea       1
Name: cups, dtype: int64
test['normalized_cups']=test.cups/t2.reindex(test.drink).values

推荐阅读