首页 > 解决方案 > 为缺失期间添加一行,并为相应期间计算最近 3 个月的平均值

问题描述

我正在尝试编写一个代码,将缺失的周期添加到数据帧并计算它们各自的平均值。请参考以下示例:

Invoice Date    Amount
9   01/2020 227500
4   02/2020 56000
0   03/2020 22000
1   05/2020 25000
5   06/2020 75000
2   07/2020 27000
6   08/2020 48000
3   09/2020 35000
7   10/2020 115000
8   12/2020 85000

在上面的数据框中,我们看到缺少“11/2020”的记录。我正在尝试添加 11/2020 期间的记录并计算过去三个月的平均值,即如果缺少 11/2020,则取 12/2020、10/2020 和 9/2020 的金额并计算其表示并将其添加/附加到数据框。

Expected output:
Invoice Date     Amount
10      01/2020  227500.00
4       02/2020   56000.00
0       03/2020   22000.00
5       04/2020   75000.00
1       05/2020   25000.00
6       06/2020   48000.00
2       07/2020   27000.00
7       08/2020  115000.00
3       09/2020   35000.00
8       10/2020   77000.00
11      11/2020   65666.67
9       12/2020   85000.00

请注意,我可以使用以下代码得出上述结果:

import pandas as pd

FundAdmin = {
    'Invoice Date': ['03/2020', '05/2020', '07/2020', '09/2020', '02/2020', '04/2020', '06/2020', '08/2020', '10/2020', '12/2020',
        '01/2020'
    ],
    'Amount': [22000, 25000, 27000, 35000, 56000, 75000, 48000, 115000, 77000, 85000, 227500]
}

expected_dates = ['01/2020', '02/2020', '03/2020', '04/2020', '05/2020', '06/2020', '07/2020', '08/2020', '09/2020', '10/2020', '11/2020',
    '12/2020'
]

df = pd.DataFrame(FundAdmin, columns = ['Invoice Date', 'Amount'])
current_dates = df['Invoice Date']
missing_dates = list(set(expected_dates) - set(current_dates))
sorted_df = df.sort_values(by = 'Invoice Date')
for i in missing_dates:
    Top_3_Rows = sorted_df.tail(3)# print(Top_3_Rows)
Top_3_Rows_Amount = round(Top_3_Rows.mean(), 2)
CalcDF = {
    'Invoice Date': i,
    'Amount': float(Top_3_Rows_Amount)
}
FullDF = df.append(CalcDF, ignore_index = True)
print(FullDF)

但是,我的代码无法处理数据帧中间缺失记录的计算。意思是,它向数据框添加了缺失的期间,但无法获取前 3 个月的值,并且它为所有缺失的期间添加了相同的平均值。示例:如果缺少 2020 年 4 月的记录,代码应该能够添加 2020 年 4 月的新记录,并将 1/2020、2/2020 和 3/2020 生成的平均值分配给 2020 年 4 月. 相反,它分配了其他缺失时期的平均值。请参考以下内容:

Expected Output (if both 11/2020 and 4/2020 are missing):

    Invoice Date     Amount
    10      01/2020  227500.00
    4       02/2020   56000.00
    0       03/2020   22000.00
    5       04/2020  101833.33  <---- New Record Inserted for 4/2020 through the calculation the mean for 3/2020,2/2020,1/2020
    1       05/2020   25000.00
    6       06/2020   48000.00
    2       07/2020   27000.00
    7       08/2020  115000.00
    3       09/2020   35000.00
    8       10/2020   77000.00
    11      11/2020   65666.67  <---- New Record Inserted for 11/2020 through the calculation the mean for 12/2020,10/2020,9/2020
    9       12/2020   85000.00

My Output (if both 11/2020 and 4/2020 are missing):

Invoice Date     Amount
10      01/2020  227500.00
4       02/2020   56000.00
0       03/2020   22000.00
5       04/2020   65666.67 <--- Value same as 11/2020
1       05/2020   25000.00
6       06/2020   48000.00
2       07/2020   27000.00
7       08/2020  115000.00
3       09/2020   35000.00
8       10/2020   77000.00
11      11/2020   65666.67 <--- This works fine.
9       12/2020   85000.00

根据我的观察,我发现如果丢失的时间段恰好位于数据帧的中间,我的代码无法获取最后 3 条记录,因为我正在使用 tail() 方法并且它正在获取 9/2020 的记录,10/2020 和 12/2020,计算其平均值并将相同的值分配给 4/2020。我是 python 的完整初学者,如果为解决上述问题提供任何帮助,我们将不胜感激。

标签: pythonpandas

解决方案


这对你有用吗?

import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from random import randint

df_len = 100

df = pd.DataFrame({
    'Invoice': [randint(1, 10) for _ in range(df_len)],
    'Dates' : [(datetime.today() - pd.DateOffset(months=mnths_ago)).date() 
               for mnths_ago in range(df_len)],
    'Amount': [randint(1, 100000) for _ in range(df_len)],
})

# Drop 10 random rows
drop_indices = np.random.choice(df.index, 10, replace=False)
df = df.drop(drop_indices)

df

    Invoice       Dates  Amount
0         1  2020-05-19   23797
1         6  2020-04-19   54101
2        10  2020-03-19   91522
3         5  2020-02-19   48762
4         1  2020-01-19   54497
..      ...         ...     ...
93        1  2012-08-19   56834
94       10  2012-07-19   21382
95        2  2012-06-19   33056
96        1  2012-05-19   93336
98        7  2012-03-19   12406

from dateutil import relativedelta

def get_prev_mean(date):
    return df[:df.loc[df.Dates == date].index[0]].tail(3)['Amount'].mean()


r = relativedelta.relativedelta(df.Dates.min(), df.Dates.max())
n_months = -(r.years * 12) + r.months
all_months = [(df.Dates.max() - pd.DateOffset(months=mnths_ago)).date() for mnths_ago in range(n_months)]
missing_months = [mnth for mnth in all_months if mnth in list(df.Dates)]
dct = {mnth: get_prev_mean(mnth) for mnth in missing_months}
to_merge = pd.DataFrame(data=dct.values(), index=dct.keys()).reset_index()
to_merge.columns = ['Dates', 'Amount']
out = pd.concat([df, to_merge], sort=False).sort_values(by='Dates').reset_index(drop=True)
out

     Invoice       Dates   Amount
0        7.0  2012-03-19  12406.0
1        1.0  2012-05-19  93336.0
2        2.0  2012-06-19  33056.0
3       10.0  2012-07-19  21382.0
4        1.0  2012-08-19  56834.0
..       ...         ...      ...
171     10.0  2020-03-19  91522.0
172      NaN  2020-04-19  23797.0
173      6.0  2020-04-19  54101.0
174      NaN  2020-05-19      NaN
175      1.0  2020-05-19  23797.0

推荐阅读