首页 > 解决方案 > 累计/自 Pandas 首次发货后 3 天内发货的累计数量

问题描述

这有点难以解释,但我会尽力而为,请多多包涵。

我有一个带有 ID、发货日期和单位的 pd。我想计算在 3 天内发货的单位,并且计数不应重叠,例如我的数据框如下。

ID          Shipping Date Units Expected output
153131151007    20180801    1   1
153131151007    20180828    1   2
153131151007    20180829    1   0
153131151007    20180904    1   1
153131151007    20181226    2   4
153131151007    20181227    1   0
153131151007    20181228    1   0
153131151007    20190110    1   1
153131151007    20190115    2   3
153131151007    20190116    1   0
153131151011*   20180510    1   2
153131151011*   20180511    1   0
153131151011*   20180513    1   2
153131151011*   20180515    1   0
153131151011*   20180813    1   1
153131151011*   20180822    1   2
153131151011*   20180824    1   0
153131151011*   20190103    1   1

代码应检查日期,查看未来 3 天内是否有发货,如果有发货,则应在其当前日期列中求和,并确保不考虑下一个日期计算的总和。

因此,对于第一个 ID 发货日期 20181226,它会检查 1226、1227、1228 并将它们相加并在 1226 中显示结果,并在接下来的 2 个单元格中显示 0。

类似地,对于第二个 ID 20180510,0510 是该系列中发货的第一个日期。它检查 0510,0511 和 0512 并将其与 0510 相加并将其余部分归零,这就是 0511 不考虑 0513 并且它是其他装运组的一部分的原因。

data = pd.DataFrame({'ID':['153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*'],
'Date':[20180801,20180828,20180829,20180904,20181226,20181227,20181228,20190110,20190115,20190116,20180510,20180511,20180513,20180515,20180813,20180822,20180824,20190103],
'Units':[1,1,1,1,2,1,1,1,2,1,1,1,1,1,1,1,1,1]})

标签: pythonpandascumulative-sumrolling-computation

解决方案


这可行,但结果是宽格式:

import pandas as pd
import numpy as np
from dateutil.parser import parse
from datetime import timedelta

data = pd.DataFrame({'ID':['153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*'],
'Date':[20180801,20180828,20180829,20180904,20181226,20181227,20181228,20190110,20190115,20190116,20180510,20180511,20180513,20180515,20180813,20180822,20180824,20190103],
'Units':[1,1,1,1,2,1,1,1,2,1,1,1,1,1,1,1,1,1]})

def keep_first(ser):
    ixs = []
    ts = ser.dropna().index[0]
    while ts <= ser.dropna().index.max():
        if ts in ser.dropna().index:
            ixs.append(ts)
            ts+=timedelta(3)
        else:
            ts+=timedelta(1)
    return np.where(ser.index.isin(ixs), ser, 0)

data['Date'] = data['Date'].map(lambda x: parse(str(x))) # parse dates

units = data.groupby(['ID', 'Date']).sum().unstack(0).resample('D').sum() # create resampled units df

units = units.sort_index(ascending=False).rolling(3, min_periods=1).sum().sort_index() # calculate forward-rolling sum

grouped_ix = data.groupby(['ID', 'Date']).sum().unstack(0).index # get indices for actual data

units.loc[grouped_ix].apply(keep_first) # get sums for actual data indices, keep only first

推荐阅读