首页 > 解决方案 > 如何在 Pandas 中使用 Groupby() 回填和转发对应于每个月的开始和结束的填充值

问题描述

假设我有一个数据框,其中包含如下所示的列Date$

>>> import pandas as pd
>>> data = [['2021-01-02', 1.0], ['2021-01-05', 2.0], ['2021-02-05', 3.0]]
>>> df = pd.DataFrame(data, columns=['Date','$'])

由于原始数据有日期空白,我使用此 StackOverflow 帖子中回答的建议填补了日期空白,如下所示:

>>> df.Date = pd.to_datetime(df.Date)
>>> mn = df.Date.min()
>>> mx = df.Date.max()
>>>
>>> dr = pd.date_range(
...     mn - pd.tseries.offsets.MonthBegin(),
...     mx + pd.tseries.offsets.MonthEnd(),
...     name="Date",
... )
>>>
>>> df = df.set_index("Date").reindex(dr).ffill().bfill().reset_index()
>>> print(df)
         Date    $
0  2021-01-01  1.0
1  2021-01-02  1.0
2  2021-01-03  1.0
3  2021-01-04  1.0
4  2021-01-05  2.0
5  2021-01-06  2.0
6  2021-01-07  2.0
7  2021-01-08  2.0
8  2021-01-09  2.0
9  2021-01-10  2.0
10 2021-01-11  2.0
11 2021-01-12  2.0
12 2021-01-13  2.0
13 2021-01-14  2.0
14 2021-01-15  2.0
15 2021-01-16  2.0
16 2021-01-17  2.0
17 2021-01-18  2.0
18 2021-01-19  2.0
19 2021-01-20  2.0
20 2021-01-21  2.0
21 2021-01-22  2.0
22 2021-01-23  2.0
23 2021-01-24  2.0
24 2021-01-25  2.0
25 2021-01-26  2.0
26 2021-01-27  2.0
27 2021-01-28  2.0
28 2021-01-29  2.0
29 2021-01-30  2.0
30 2021-01-31  2.0
31 2021-02-01  2.0 # <== here, the $ value should be 3.0 and onward
32 2021-02-02  2.0
33 2021-02-03  2.0
34 2021-02-04  2.0
35 2021-02-05  3.0
36 2021-02-06  3.0
37 2021-02-07  3.0
38 2021-02-08  3.0
39 2021-02-09  3.0
40 2021-02-10  3.0
41 2021-02-11  3.0
42 2021-02-12  3.0
43 2021-02-13  3.0
44 2021-02-14  3.0
45 2021-02-15  3.0
46 2021-02-16  3.0
47 2021-02-17  3.0
48 2021-02-18  3.0
49 2021-02-19  3.0
50 2021-02-20  3.0
51 2021-02-21  3.0
52 2021-02-22  3.0
53 2021-02-23  3.0
54 2021-02-24  3.0
55 2021-02-25  3.0
56 2021-02-26  3.0
57 2021-02-27  3.0
58 2021-02-28  3.0

使用这种方法,前向填充,ffill,而不检查月份的边界,将$值复制得有点过远,如您在上面看到的那样。我环顾 StackOverflow,发现(例如,这篇文章)有一种方法可以groupby()用来了解每个月的边界。这将我引向以下这一点:

>>> start_date_of_each_month = (df.set_index('Date').index.to_series().groupby(pd.Grouper(freq='M')).min())
>>> start_date_of_each_month
Date
2021-01-31   2021-01-01
2021-02-28   2021-02-01
Freq: M, Name: Date, dtype: datetime64[ns]

问:我如何利用它来更正$上述情况,以便每个月的值包含在该特定月份内?特别是,我如何将其转换df为这样的?

         Date    $
0  2021-01-01  1.0
1  2021-01-02  1.0
2  2021-01-03  1.0
3  2021-01-04  1.0
4  2021-01-05  2.0
5  2021-01-06  2.0
6  2021-01-07  2.0
7  2021-01-08  2.0
8  2021-01-09  2.0
9  2021-01-10  2.0
10 2021-01-11  2.0
11 2021-01-12  2.0
12 2021-01-13  2.0
13 2021-01-14  2.0
14 2021-01-15  2.0
15 2021-01-16  2.0
16 2021-01-17  2.0
17 2021-01-18  2.0
18 2021-01-19  2.0
19 2021-01-20  2.0
20 2021-01-21  2.0
21 2021-01-22  2.0
22 2021-01-23  2.0
23 2021-01-24  2.0
24 2021-01-25  2.0
25 2021-01-26  2.0
26 2021-01-27  2.0
27 2021-01-28  2.0
28 2021-01-29  2.0
29 2021-01-30  2.0
30 2021-01-31  2.0
31 2021-02-01  3.0 # <== here, the $ value should be 3.0
32 2021-02-02  3.0
33 2021-02-03  3.0
34 2021-02-04  3.0
35 2021-02-05  3.0
36 2021-02-06  3.0
37 2021-02-07  3.0
38 2021-02-08  3.0
39 2021-02-09  3.0
40 2021-02-10  3.0
41 2021-02-11  3.0
42 2021-02-12  3.0
43 2021-02-13  3.0
44 2021-02-14  3.0
45 2021-02-15  3.0
46 2021-02-16  3.0
47 2021-02-17  3.0
48 2021-02-18  3.0
49 2021-02-19  3.0
50 2021-02-20  3.0
51 2021-02-21  3.0
52 2021-02-22  3.0
53 2021-02-23  3.0
54 2021-02-24  3.0
55 2021-02-25  3.0
56 2021-02-26  3.0
57 2021-02-27  3.0
58 2021-02-28  3.0

提前感谢您的回答/建议!

标签: pythonpandas

解决方案


一些选项:

  1. 石斑鱼
import pandas as pd

data = [['2021-01-02', 1.0], ['2021-01-05', 2.0], ['2021-02-05', 3.0]]
df = pd.DataFrame(data, columns=['Date', '$'])

df.Date = pd.to_datetime(df.Date)
mn = df.Date.min()
mx = df.Date.max()

dr = pd.date_range(
    mn - pd.tseries.offsets.MonthBegin(),
    mx + pd.tseries.offsets.MonthEnd(),
    name="Date",
)

df = df.set_index("Date").reindex(dr).reset_index()

df['$'] = df.groupby(pd.Grouper(key='Date', freq='1M'))['$'].ffill().bfill()

print(df)
  1. dt.strftime %m:
import pandas as pd

data = [['2021-01-02', 1.0], ['2021-01-05', 2.0], ['2021-02-05', 3.0]]
df = pd.DataFrame(data, columns=['Date', '$'])

df.Date = pd.to_datetime(df.Date)
mn = df.Date.min()
mx = df.Date.max()

dr = pd.date_range(
    mn - pd.tseries.offsets.MonthBegin(),
    mx + pd.tseries.offsets.MonthEnd(),
    name="Date",
)

df = df.set_index("Date").reindex(dr).reset_index()

df['$'] = df.groupby(df['Date'].dt.strftime('%m'))['$'].ffill().bfill()

print(df)
  1. 重采样
import pandas as pd

data = [['2021-01-02', 1.0], ['2021-01-05', 2.0], ['2021-02-05', 3.0]]
df = pd.DataFrame(data, columns=['Date', '$'])

df.Date = pd.to_datetime(df.Date)
mn = df.Date.min()
mx = df.Date.max()

dr = pd.date_range(
    mn - pd.tseries.offsets.MonthBegin(),
    mx + pd.tseries.offsets.MonthEnd(),
    name="Date",
)

df = df.set_index("Date").reindex(dr)

df['$'] = df.resample('M')['$'].ffill().bfill()

df = df.reset_index()

print(df)

所有产品:

         Date    $
0  2021-01-01  1.0
1  2021-01-02  1.0
2  2021-01-03  1.0
3  2021-01-04  1.0
4  2021-01-05  2.0
5  2021-01-06  2.0
6  2021-01-07  2.0
7  2021-01-08  2.0
      ...     
30 2021-01-31  2.0
31 2021-02-01  3.0
32 2021-02-02  3.0
33 2021-02-03  3.0
34 2021-02-04  3.0
35 2021-02-05  3.0
36 2021-02-06  3.0
37 2021-02-07  3.0
38 2021-02-08  3.0
      ...
57 2021-02-27  3.0
58 2021-02-28  3.0

根据您希望如何处理没有值的月份,您可以apply使用 ffill 和 bfill 代替:

import pandas as pd

data = [['2020-12-02', 1.0], ['2021-02-05', 2.0], ['2021-03-05', 3.0]]
df = pd.DataFrame(data, columns=['Date', '$'])

df.Date = pd.to_datetime(df.Date)
mn = df.Date.min()
mx = df.Date.max()

dr = pd.date_range(
    mn - pd.tseries.offsets.MonthBegin(),
    mx + pd.tseries.offsets.MonthEnd(),
    name="Date",
)

df = df.set_index("Date").reindex(dr)

df['$'] = df.resample('M')['$'].apply(lambda s: s.ffill().bfill())

df = df.reset_index()

print(df.to_string())

一月没有值,因此它保持为 NaN ,而不是以前的方法,它会用最后一个已知值填充一月1.0

          Date    $
0   2020-12-01  1.0
       ...
30  2020-12-31  1.0
31  2021-01-01  NaN
       ...
61  2021-01-31  NaN
62  2021-02-01  2.0
       ...
89  2021-02-28  2.0
90  2021-03-01  3.0
91  2021-03-02  3.0
       ...
120 2021-03-31  3.0

推荐阅读