首页 > 解决方案 > 如何填补日期列中的空白?

问题描述

我有一个数据框 df,其中包含每个产品“id”和产品“质量”的每日销售数字,例如:

date       | id | qual | col1 | col2 | col3
2020-10-01 | a  | B    | 1    | 2    | 3
2020-10-02 | a  | B    | 1    | 2    | 3
2020-10-05 | a  | B    | 13   | 8    | 7
2020-10-02 | a  | C    | 3    | 6    | 9
2020-10-05 | a  | C    | 26   | 16   | 14
2020-10-02 | b  | B    | 100  | 200  | 300
2020-10-04 | b  | B    | 49   | 19   | 9

产品 = 列“id”

对于销售额为 0 的日子,我没有条目,因此它们丢失了。但我想为每个产品“id”添加那些缺失的日子。

每个产品的结束日期应该是我可以通过变量设置的日期,比如

end_date = "2020-10-10"

或整个数据框中的最新日期(对于任何产品“id”),例如

end_date = "2020-10-05"

结果应如下所示(对于变量 end_date = "2020-10-10"):

date       | id | qual | col1 | col2 | col3
2020-10-01 | a  | B    | 1    | 2    | 3
2020-10-02 | a  | B    | 1    | 2    | 3
2020-10-03 | a  | B    | 0    | 0    | 0
2020-10-04 | a  | B    | 0    | 0    | 0
2020-10-05 | a  | B    | 13   | 8    | 7
2020-10-06 | a  | B    | 0    | 0    | 0
2020-10-07 | a  | B    | 0    | 0    | 0
2020-10-08 | a  | B    | 0    | 0    | 0
2020-10-09 | a  | B    | 0    | 0    | 0
2020-10-10 | a  | B    | 0    | 0    | 0
2020-10-01 | a  | C    | 0    | 0    | 0
2020-10-02 | a  | C    | 3    | 6    | 9
2020-10-03 | a  | C    | 0    | 0    | 0
2020-10-04 | a  | C    | 0    | 0    | 0
2020-10-05 | a  | C    | 26   | 16   | 14
2020-10-06 | a  | C    | 0    | 0    | 0
2020-10-07 | a  | C    | 0    | 0    | 0
2020-10-08 | a  | C    | 0    | 0    | 0
2020-10-09 | a  | C    | 0    | 0    | 0
2020-10-10 | a  | C    | 0    | 0    | 0
2020-10-01 | b  | B    | 0    | 0    | 0
2020-10-02 | b  | B    | 100  | 200  | 300
2020-10-03 | b  | B    | 0    | 0    | 0
2020-10-04 | b  | B    | 49   | 19   | 9
2020-10-05 | b  | B    | 0    | 0    | 0
2020-10-06 | b  | B    | 0    | 0    | 0
2020-10-07 | b  | B    | 0    | 0    | 0
2020-10-08 | b  | B    | 0    | 0    | 0
2020-10-09 | b  | B    | 0    | 0    | 0
2020-10-10 | b  | B    | 0    | 0    | 0

或者像这样(对于日期列中的最新日期=“2020-10-05”)

date       | id | qual | col1 | col2 | col3
2020-10-01 | a  | B    | 1    | 2    | 3
2020-10-02 | a  | B    | 1    | 2    | 3
2020-10-03 | a  | B    | 0    | 0    | 0
2020-10-04 | a  | B    | 0    | 0    | 0
2020-10-05 | a  | B    | 13   | 8    | 7
2020-10-01 | a  | C    | 0    | 0    | 0
2020-10-02 | a  | C    | 3    | 6    | 9
2020-10-03 | a  | C    | 0    | 0    | 0
2020-10-04 | a  | C    | 0    | 0    | 0
2020-10-05 | a  | C    | 26   | 16   | 14
2020-10-01 | b  | B    | 0    | 0    | 0
2020-10-02 | b  | B    | 100  | 200  | 300
2020-10-03 | b  | B    | 0    | 0    | 0
2020-10-04 | b  | B    | 49   | 19   | 9
2020-10-05 | b  | B    | 0    | 0    | 0

如何为熊猫中的每个变体执行此操作?

标签: pythonpandasdate

解决方案


在每个组中使用DataFrame.reindexwith in lambda 函数,例如:date_range

df['date']= pd.to_datetime(df['date'])

end_date = "2020-10-10"
df = (df.set_index('date')
        .groupby(['id','qual'])
        .apply(lambda x: x.reindex(pd.date_range(x.index.min(), end_date), fill_value=0))
        .drop(['id','qual'], axis=1)
        .rename_axis(['id','qual','date'])
        .reset_index())

如果出现需要预处理数据的错误,请在groupby + apply解决方案之前添加代码:

“ValueError:无法从重复轴重新索引”

id这意味着每个with dates都有重复项。

可能的解决方案是首先删除重复项:

df = df.drop_duplicates(['date','qual','id'])

或聚合,例如按总和:

df = df.groupby(['date','qual','id']).sum()

推荐阅读