首页 > 解决方案 > 如何在我的熊猫数据框中填充前一天的非空值

问题描述

我想在具有空值的当前日期值中填充前几天的值和小时值。

考虑我的数据框如下: -

creTimestamp CPU负载 实例 ID
0 2021-01-22 18:48:00 22.0 实例A
1 2021-01-23 20:25:00 23.0 实例A
2 2021-01-22 18:42:00 22.0 实例A
3 2021-01-22 15:24:00 23.0 实例B
4 2021-01-24 20:25:00 实例A
5 2021-01-22 08:53:00 22.0 实例A
6 2021-01-23 19:43:00 23.0 实例B
7 2021-01-23 15:24:00 实例A
8 2021-01-24 18:48:00 实例A
9 2021-01-24 01:51:00 实例B
10 2021-01-24 15:24:00 实例A

因为,2021-01-24 15:24:00 是 NaN,然后​​它会寻找 2021-01-23 15:24:00 但这也是空的,所以它接下来是 2021-01-22 15:24:00并填写 23 和 24 日期的值。

结果数据框应如下所示:-

creTimestamp CPU负载 实例 ID
0 2021-01-22 18:48:00 22.0 实例A
1 2021-01-23 20:25:00 23.0 实例A
2 2021-01-22 18:42:00 22.0 实例A
3 2021-01-22 15:24:00 23.0 实例B
4 2021-01-24 20:25:00 23.0 实例A
5 2021-01-22 08:53:00 22.0 实例A
6 2021-01-23 19:43:00 23.0 实例B
7 2021-01-23 15:24:00 23.0 实例A
8 2021-01-24 18:48:00 22.0 实例A
9 2021-01-24 01:51:00 实例B
10 2021-01-24 15:24:00 23.0 实例A

对于不匹配的,保持空值不变。

请注意:-我不能使用 ffill() 或线性插值,因为它会破坏我的绘图并随机取值。我需要考虑前几天的相同小时和分钟。另外,我需要最多返回 7 天来考虑这些值。

请帮助我,因为我在这里很长时间了。

谢谢

标签: python-3.xpandasdataframe

解决方案


这里的方法是将 DF 连接回自身以获得先前的值。提供了两个这样的例子

  1. 前一天
  2. 不是的时间戳NaN

为透明起见,已将工作列留在原处。

import io
df = pd.read_csv(io.StringIO("""    creTimestamp    CPULoad instnceId
0   2021-01-22 18:48:00 22.0    instanceA
1   2021-01-23 20:25:00 23.0    instanceA
2   2021-01-22 18:42:00 22.0    instanceA
3   2021-01-22 15:24:00 23.0    instanceB
4   2021-01-24 20:25:00 NaN instanceA
5   2021-01-22 08:53:00 22.0    instanceA
6   2021-01-23 19:43:00 23.0    instanceB
7   2021-01-23 15:24:00 NaN instanceA
8   2021-01-24 18:48:00 NaN instanceA
9   2021-01-24 01:51:00 NaN instanceB
10  2021-01-24 15:24:00 NaN instanceA
"""), sep="\t", index_col=0)

df.creTimestamp = df.creTimestamp = pd.to_datetime(df.creTimestamp)
# literally take previous day value
df2 = (df
 .assign(yesterday=lambda dfa: dfa.creTimestamp-pd.Timedelta(days=1))
 .merge(df.rename(columns={"creTimestamp":"yesterday"}).loc[:,["yesterday","CPULoad"]]
        , on="yesterday", suffixes=("", "_pre"), how="left")
 .assign(CPULoad=lambda dfa: dfa.CPULoad.fillna(dfa.CPULoad_pre))
)

# take timestamp forward,  beware if DF has multiple values for same timestamp
df2 = (df
 .assign(timestamp=lambda dfa: dfa.creTimestamp.dt.time)
 .merge(df.assign(timestamp=lambda dfa: dfa.creTimestamp.dt.time)
        .loc[:,["timestamp","CPULoad"]]
        .dropna()
        , on="timestamp", suffixes=("", "_pre"), how="left")
 .assign(CPULoad=lambda dfa: dfa.CPULoad.fillna(dfa.CPULoad_pre))
)

输出

       creTimestamp  CPULoad instnceId timestamp  CPULoad_pre
2021-01-22 18:48:00     22.0 instanceA  18:48:00         22.0
2021-01-23 20:25:00     23.0 instanceA  20:25:00         23.0
2021-01-22 18:42:00     22.0 instanceA  18:42:00         22.0
2021-01-22 15:24:00     23.0 instanceB  15:24:00         23.0
2021-01-24 20:25:00     23.0 instanceA  20:25:00         23.0
2021-01-22 08:53:00     22.0 instanceA  08:53:00         22.0
2021-01-23 19:43:00     23.0 instanceB  19:43:00         23.0
2021-01-23 15:24:00     23.0 instanceA  15:24:00         23.0
2021-01-24 18:48:00     22.0 instanceA  18:48:00         22.0
2021-01-24 01:51:00      NaN instanceB  01:51:00          NaN
2021-01-24 15:24:00     23.0 instanceA  15:24:00         23.0

更新

  • 在大型数据帧(不是样本)中,可以有多个具有不同值的时间戳
  • 使用使时间戳唯一,drop_duplicates()因此merge()将返回原始 DF 中的行数
  • 将意味着NaN填充了时间戳的最后观察值
  • 添加了额外的加入密钥
# take timestamp forward,  beware if DF has multiple values for same timestamp
# taking last observed value to prevent merge generating duplicates
# also include instnceId in join key...
df2 = (df
 .assign(timestamp=lambda dfa: dfa.creTimestamp.dt.time)
 .merge(df.assign(timestamp=lambda dfa: dfa.creTimestamp.dt.time)
        .loc[:,["instnceId", "timestamp","CPULoad"]]
        .dropna()
        .drop_duplicates(subset=["instnceId","timestamp"], keep="last")
        , on=["instnceId","timestamp"], suffixes=("", "_pre"), how="left")
 .assign(CPULoad=lambda dfa: dfa.CPULoad.fillna(dfa.CPULoad_pre))
 .drop(columns=["timestamp","CPULoad_pre"])
)

推荐阅读