首页 > 解决方案 > Python中基于pct_change的一列的groupby和反向计算

问题描述

我有一个数据框df1,其中有四列,假设所有citys 的日期范围从2019-01-01to 2019-07-01,我想分组cityprice根据它在2019-07-01and中的值进行计算pct_change

   city        date  price  pct_change
0    bj  2019-01-01    NaN         NaN
1    bj  2019-02-01    NaN       -0.03
2    bj  2019-03-01    NaN        0.16
3    bj  2019-04-01    NaN        0.07
4    bj  2019-05-01    NaN        0.19
5    bj  2019-06-01    NaN       -0.05
6    bj  2019-07-01    6.0       -0.02
7    gz  2019-01-01    NaN         NaN
8    gz  2019-02-01    NaN        0.03
9    gz  2019-03-01    NaN        0.00
10   gz  2019-04-01    NaN        0.03
11   gz  2019-05-01    NaN        0.00
12   gz  2019-06-01    NaN        0.06
13   gz  2019-07-01    NaN        0.07
14   gz  2019-08-01    8.9       -0.02
15   sh  2019-02-01    NaN        0.04
16   sh  2019-03-01    NaN       -0.04
17   sh  2019-04-01    NaN       -0.04
18   sh  2019-05-01    NaN       -0.04
19   sh  2019-06-01    NaN       -0.04
20   sh  2019-07-01    NaN       -0.01
21   sh  2019-08-01    7.5       -0.01
22   sz  2019-02-01    NaN       -0.03
23   sz  2019-03-01    NaN        0.10
24   sz  2019-04-01    NaN       -0.04
25   sz  2019-05-01    NaN       -0.16
26   sz  2019-06-01    NaN        0.12
27   sz  2019-07-01    7.0        0.00

例如,在 Excel 中,我可以反向计算第五个rowprice值 by 6.0/(1+(-0.02)) = 6.12,第四个priceby6.12/(1+(-0.05)) = 6.44等。

df2是否有可能像在 Python 中那样得到预期的结果(不必完全相同) ?

   city        date  price  pct_change
0    bj  2019-01-01   4.49       -0.03
1    bj  2019-02-01   4.34        0.16
2    bj  2019-03-01   5.04        0.07
3    bj  2019-04-01   5.39        0.19
4    bj  2019-05-01   6.43       -0.05
5    bj  2019-06-01   6.11       -0.02
6    bj  2019-07-01   6.00        0.05
7    gz  2019-01-01   7.58        0.03
8    gz  2019-02-01   7.79        0.00
9    gz  2019-03-01   7.80        0.03
10   gz  2019-04-01   8.04        0.00
11   gz  2019-05-01   8.04        0.06
12   gz  2019-06-01   8.52        0.07
13   gz  2019-07-01   9.10       -0.02
14   gz  2019-08-01   8.90        0.00
15   sh  2019-01-01   8.81        0.04
16   sh  2019-02-01   9.16        0.02
17   sh  2019-03-01   8.79       -0.04
18   sh  2019-04-01   8.43       -0.12
19   sh  2019-05-01   8.06       -0.04
20   sh  2019-06-01   7.70        0.07
21   sh  2019-07-01   7.60       -0.01
22   sh  2019-08-01   7.50        0.06
23   sz  2019-01-01   7.30       -0.03
24   sz  2019-02-01   7.10        0.10
25   sz  2019-03-01   7.80       -0.04
26   sz  2019-04-01   7.45       -0.16
27   sz  2019-05-01   6.28        0.12
28   sz  2019-06-01   7.02        0.00
29   sz  2019-07-01   7.00       -0.04

如果我有df3以下情况,请注意:

   city        date  price
0    bj  2019-01-01   4.49
1    bj  2019-02-01   4.34
2    bj  2019-03-01   5.04
3    bj  2019-04-01   5.39
4    bj  2019-05-01   6.43
5    bj  2019-06-01   6.11
6    bj  2019-07-01   6.00
7    gz  2019-01-01   7.58
8    gz  2019-02-01   7.79
9    gz  2019-03-01   7.80
10   gz  2019-04-01   8.04
11   gz  2019-05-01   8.04
12   gz  2019-06-01   8.52
13   gz  2019-07-01   9.10
14   gz  2019-08-01   8.90
15   sh  2019-01-01   8.81
16   sh  2019-02-01   9.16
17   sh  2019-03-01   8.79
18   sh  2019-04-01   8.43
19   sh  2019-05-01   8.06
20   sh  2019-06-01   7.70
21   sh  2019-07-01   7.60
22   sh  2019-08-01   7.50
23   sz  2019-01-01   7.30
24   sz  2019-02-01   7.10
25   sz  2019-03-01   7.80
26   sz  2019-04-01   7.45
27   sz  2019-05-01   6.28
28   sz  2019-06-01   7.02
29   sz  2019-07-01   7.00

我可以df2通过以下代码获得:

data = data.set_index(['city', 'date']).unstack().stack(dropna=False).reset_index()
data['date'] = pd.to_datetime(data['date']).dt.date
data = data.sort_values(by = ["city", "date"])
data['pct_change'] = data.groupby('city').price.pct_change()

谢谢你的帮助。

标签: pythonpandasnumpy

解决方案


您可以cumprod在 a中使用groupby,但您需要将数据反转两次,[::-1]例如:

df1['estimate_price'] = (df1.fillna({'price':1})\ #for later multiplication
                            .groupby('city')\
                            .apply(lambda x: (x['price']/(1 + x['pct_change'].shift(-1).fillna(0)))[::-1].cumprod()[::-1])\
                             .reset_index(level=0, drop=True))#to use index alignment
print (df1)
   city        date  price  pct_change  estimate_price
0    bj  2019-01-01    NaN         NaN        4.498224
1    bj  2019-02-01    NaN       -0.03        4.363278
2    bj  2019-03-01    NaN        0.16        5.061402
3    bj  2019-04-01    NaN        0.07        5.415700
4    bj  2019-05-01    NaN        0.19        6.444683
5    bj  2019-06-01    NaN       -0.05        6.122449
6    bj  2019-07-01    6.0       -0.02        6.000000
7    gz  2019-01-01    NaN         NaN        7.547443
8    gz  2019-02-01    NaN        0.03        7.773866
9    gz  2019-03-01    NaN        0.00        7.773866
10   gz  2019-04-01    NaN        0.03        8.007082
11   gz  2019-05-01    NaN        0.00        8.007082
12   gz  2019-06-01    NaN        0.06        8.487507
13   gz  2019-07-01    NaN        0.07        9.081633
14   gz  2019-08-01    8.9       -0.02        8.900000
15   sh  2019-02-01    NaN        0.04        9.009609
16   sh  2019-03-01    NaN       -0.04        8.649225
17   sh  2019-04-01    NaN       -0.04        8.303256
18   sh  2019-05-01    NaN       -0.04        7.971125
19   sh  2019-06-01    NaN       -0.04        7.652280
20   sh  2019-07-01    NaN       -0.01        7.575758
21   sh  2019-08-01    7.5       -0.01        7.500000
22   sz  2019-02-01    NaN       -0.03        7.045905
23   sz  2019-03-01    NaN        0.10        7.750496
24   sz  2019-04-01    NaN       -0.04        7.440476
25   sz  2019-05-01    NaN       -0.16        6.250000
26   sz  2019-06-01    NaN        0.12        7.000000
27   sz  2019-07-01    7.0        0.00        7.000000

推荐阅读