首页 > 解决方案 > Python:重采样和前向填充到最近一个月

问题描述

根据以下数据集,我有一个大型数据集,其中包含每个 property_id 的每个月产生的成本的成本。

property_id period      amount 
1           2016-07-01  105908.20
1           2016-08-01  0.00
2           2016-08-01  114759.40
3           2014-05-01  -934.00
3           2014-06-01  -845.95
3           2017-12-01  92175.77
4           2015-09-01  -1859.75
4           2015-12-01  1859.75
4           2017-12-01  130105.00
5           2014-07-01  -6929.58

我想创建一个按property_id 分组的累积总和,并从该property_id 的第一个月到最近的完整月份,每个月将其结转。

我已经尝试了以下方法,其中我使用 property_id 重新采样并尝试转发填充,但它给出了一个错误

cost = cost.groupby['property_id'].apply(lambda x: x.set_index('period').resample('M').fillna(method='pad'))

TypeError:“方法”对象不可下标

下面的示例输出:

> property_id   period      amount 
>     1         2016-07-01  105908.20
>     1         2016-08-01  105908.20
>     1         2016-09-01  105908.20
>     1         2016-10-01  105908.20
>     ...           
>     1         2019-07-01  105908.20
>     2         2016-08-01  114759.40
>     2         2016-09-01  114759.40
>     2         2016-10-01  114759.40
>     ...
>     2         2019-07-01  114759.40
>     3         2014-05-01  -934.00
>     3         2014-06-01  -1779.95
>     3         2014-07-01  -1779.95
>     3         2014-08-01  -1779.95
>     ...
>     3         2017-12-01  90395.82
>     3         2018-01-01  90395.82
>     3         2018-02-01  90395.82
>     3         2018-03-01  90395.82
>     ...
>     3         2019-07-01  90395.82
>     4         2015-09-01  -1859.75
>     4         2015-10-01  -1859.75
>     4         2015-11-01  -1859.75
>     4         2015-12-01  0
>     4         2016-01-01  0
>     ...
>     4         2017-11-01  0
>     4         2017-12-01  130105.00
>     4         2018-01-01  130105.00
>     ...
>     4         2019-07-01  130105.00
>     5         2014-07-01  -6929.58
>     5         2014-08-01  -6929.58
>     5         2014-09-01  -6929.58
>     ...
>     5         2019-07-01  -6929.58

任何帮助都会很棒。

谢谢!

标签: pythonpandasresampling

解决方案


先创建DatetimeIndex,然后使用groupbywith resample

df['period'] = pd.to_datetime(df['period'])

df1 = df.set_index('period').groupby('property_id').resample('M').pad()
#alternative
#df1 = df.set_index('period').groupby('property_id').resample('M').ffill()
print (df1)

                        property_id     amount
property_id period                            
1           2016-07-31            1  105908.20
            2016-08-31            1       0.00
2           2016-08-31            2  114759.40
3           2014-05-31            3    -934.00
            2014-06-30            3    -845.95
                            ...        ...
4           2017-09-30            4    1859.75
            2017-10-31            4    1859.75
            2017-11-30            4    1859.75
            2017-12-31            4  130105.00
5           2014-07-31            5   -6929.58

[76 rows x 2 columns]

编辑:想法是通过按最后一个值过滤property_id并按条件分配月份来创建新的DataFrame,然后附加到原始文件并使用上面的解决方案:

df['period'] = pd.to_datetime(df['period'])
df = df.sort_values(['property_id','period'])

last = pd.to_datetime('now').floor('d') 
nextday = (last + pd.Timedelta(1, 'd')).day
orig_month = last.to_period('m').to_timestamp()
before_month = (last.to_period('m') - 1).to_timestamp()

last = orig_month if nextday == 1 else before_month
print (last)
2019-07-01 00:00:00

df1 = df.drop_duplicates('property_id', keep='last').assign(period=last)
print (df1)
   property_id     period     amount
1            1 2019-07-01       0.00
2            2 2019-07-01  114759.40
5            3 2019-07-01   92175.77
8            4 2019-07-01  130105.00
9            5 2019-07-01   -6929.58

df = pd.concat([df, df1])
df1 = (df.set_index('period')
        .groupby('property_id')['amount']
        .resample('MS')
        .asfreq(fill_value=0)
        .groupby(level=0)
        .cumsum())
print (df1)
property_id  period    
1            2016-07-01    105908.20
             2016-08-01    105908.20
             2016-09-01    105908.20
             2016-10-01    105908.20
             2016-11-01    105908.20

5            2019-03-01   -394986.06
             2019-04-01   -401915.64
             2019-05-01   -408845.22
             2019-06-01   -415774.80
             2019-07-01   -422704.38
Name: amount, Length: 244, dtype: float64

推荐阅读