首页 > 解决方案 > 在拉伸中查找最大空值并生成标志

问题描述

我有日期时间和两列的数据框。我必须在“X”列的“特定日期”中找到空值的最大延伸,并在该特定日期的两列中将其替换为零。除此之外,我必须创建名为“flag”的第三列,对于其他两列中的每个零插补,该列的值为 1,否则值为 0。在下面的示例中,1 月 1 日,最大拉伸空值是 3 倍,所以我必须用零替换它。同样,我必须在 1 月 2 日复制该过程。

以下是我的示例数据:

Datetime            X    Y
01-01-2018 00:00    1   1
01-01-2018 00:05    nan 2
01-01-2018 00:10    2   nan
01-01-2018 00:15    3   4
01-01-2018 00:20    2   2
01-01-2018 00:25    nan 1
01-01-2018 00:30    nan nan
01-01-2018 00:35    nan nan
01-01-2018 00:40    4   4
02-01-2018 00:00    nan nan
02-01-2018 00:05    2   3
02-01-2018 00:10    2   2
02-01-2018 00:15    2   5
02-01-2018 00:20    2   2
02-01-2018 00:25    nan nan
02-01-2018 00:30    nan 1
02-01-2018 00:35    3   nan
02-01-2018 00:40    nan nan

“以下是我期待的结果”

Datetime           X    Y   Flag
01-01-2018 00:00    1   1   0
01-01-2018 00:05    nan 2   0
01-01-2018 00:10    2   nan 0
01-01-2018 00:15    3   4   0
01-01-2018 00:20    2   2   0
01-01-2018 00:25    0   0   1
01-01-2018 00:30    0   0   1
01-01-2018 00:35    0   0   1
01-01-2018 00:40    4   4   0
02-01-2018 00:00    nan nan 0
02-01-2018 00:05    2   3   0
02-01-2018 00:10    2   2   0
02-01-2018 00:15    2   5   0
02-01-2018 00:20    2   2   0
02-01-2018 00:25    nan nan 0
02-01-2018 00:30    nan 1   0
02-01-2018 00:35    3   nan 0
02-01-2018 00:40    nan nan 0

这个问题是前一个问题的延伸。这是链接Python - Find maximum null values in stretch 并替换为 0

标签: pythonpandasmissing-data

解决方案


首先为由唯一值填充的每一列创建连续组:

df1 = df.isna()
df2 = df1.ne(df1.groupby(df1.index.date).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
                        X      Y
Datetime                        
2018-01-01 00:00:00   NaN    NaN
2018-01-01 00:05:00   2.0    NaN
2018-01-01 00:10:00   NaN   36.0
2018-01-01 00:15:00   NaN    NaN
2018-01-01 00:20:00   NaN    NaN
2018-01-01 00:25:00   4.0    NaN
2018-01-01 00:30:00   4.0   72.0
2018-01-01 00:35:00   4.0   72.0
2018-01-01 00:40:00   NaN    NaN
2018-02-01 00:00:00   6.0  108.0
2018-02-01 00:05:00   NaN    NaN
2018-02-01 00:10:00   NaN    NaN
2018-02-01 00:15:00   NaN    NaN
2018-02-01 00:20:00   NaN    NaN
2018-02-01 00:25:00   8.0  144.0
2018-02-01 00:30:00   8.0    NaN
2018-02-01 00:35:00   NaN  180.0
2018-02-01 00:40:00  10.0  180.0

然后获得最大数量的组 - 这里是组4

a = df2.stack().value_counts().index[0]
print (a)
4.0

获取匹配行的掩码集0Flag列转换掩码到整数Tru/False1/0映射:

mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)

print (df)
                       X    Y  Flag
Datetime                           
2018-01-01 00:00:00  1.0  1.0     0
2018-01-01 00:05:00  NaN  2.0     0
2018-01-01 00:10:00  2.0  NaN     0
2018-01-01 00:15:00  3.0  4.0     0
2018-01-01 00:20:00  2.0  2.0     0
2018-01-01 00:25:00  0.0  0.0     1
2018-01-01 00:30:00  0.0  0.0     1
2018-01-01 00:35:00  0.0  0.0     1
2018-01-01 00:40:00  4.0  4.0     0
2018-02-01 00:00:00  NaN  NaN     0
2018-02-01 00:05:00  2.0  3.0     0
2018-02-01 00:10:00  2.0  2.0     0
2018-02-01 00:15:00  2.0  5.0     0
2018-02-01 00:20:00  2.0  2.0     0
2018-02-01 00:25:00  NaN  NaN     0
2018-02-01 00:30:00  NaN  1.0     0
2018-02-01 00:35:00  3.0  NaN     0
2018-02-01 00:40:00  NaN  NaN     0

编辑:

从列表中添加了匹配日期的新条件:

dates = df.index.floor('d')

filtered = ['2018-01-01','2019-01-01']
m = dates.isin(filtered)
df1 = df.isna() & m[:, None]

df2 = df1.ne(df1.groupby(dates).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)

print (df2)
                       X     Y
Datetime                      
2018-01-01 00:00:00  NaN   NaN
2018-01-01 00:05:00  2.0   NaN
2018-01-01 00:10:00  NaN  36.0
2018-01-01 00:15:00  NaN   NaN
2018-01-01 00:20:00  NaN   NaN
2018-01-01 00:25:00  4.0   NaN
2018-01-01 00:30:00  4.0  72.0
2018-01-01 00:35:00  4.0  72.0
2018-01-01 00:40:00  NaN   NaN
2018-02-01 00:00:00  NaN   NaN
2018-02-01 00:05:00  NaN   NaN
2018-02-01 00:10:00  NaN   NaN
2018-02-01 00:15:00  NaN   NaN
2018-02-01 00:20:00  NaN   NaN
2018-02-01 00:25:00  NaN   NaN
2018-02-01 00:30:00  NaN   NaN
2018-02-01 00:35:00  NaN   NaN
2018-02-01 00:40:00  NaN   NaN

a = df2.stack().value_counts().index[0]
#solution working also if no NaNs per filtered rows (prevent IndexError: index 0 is out of bounds)
#a = next(iter(df2.stack().value_counts().index), -1)

mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)

print (df)
                       X    Y  Flag
Datetime                           
2018-01-01 00:00:00  1.0  1.0     0
2018-01-01 00:05:00  NaN  2.0     0
2018-01-01 00:10:00  2.0  NaN     0
2018-01-01 00:15:00  3.0  4.0     0
2018-01-01 00:20:00  2.0  2.0     0
2018-01-01 00:25:00  0.0  0.0     1
2018-01-01 00:30:00  0.0  0.0     1
2018-01-01 00:35:00  0.0  0.0     1
2018-01-01 00:40:00  4.0  4.0     0
2018-02-01 00:00:00  NaN  NaN     0
2018-02-01 00:05:00  2.0  3.0     0
2018-02-01 00:10:00  2.0  2.0     0
2018-02-01 00:15:00  2.0  5.0     0
2018-02-01 00:20:00  2.0  2.0     0
2018-02-01 00:25:00  NaN  NaN     0
2018-02-01 00:30:00  NaN  1.0     0
2018-02-01 00:35:00  3.0  NaN     0

推荐阅读