首页 > 解决方案 > 生成缺失时间间隔列表

问题描述

我有 Pandas 系列,其中有一些由 nans 表示的缺失数据,并且想大致了解数据通常缺失多长时间以及多少次

一个例子是:

10:01    1.23
10:02    2.23
10:03    nan
10:04    nan
10:05    nan
10:06    6.23
10:07    nan
10:08    nan
10:09    9.23

然后期望输出将是这样的列表

missing = [[10:03,10:05], [10:07,10:08]]
N_missing = 2

标签: pythonpandas

解决方案


利用:

#create DataFrame
df = df.reset_index()
df.columns = ['A','B']

#boolean mask for check no NaNs to variable for reuse
m = df['B'].notnull()
#create index by cumulative sum for unique groups for consecutive NaNs
df.index = m.cumsum()

#filter only NaNs row and aggregate first and last value, convert to list
missing = df[~m.values].groupby(level=0)['A'].agg(['first','last']).values.tolist()
print (missing)
[['10:03', '10:05'], ['10:07', '10:08']]

#get length of nested lists
N_missing = len(missing)
print (N_missing)
2

详情

print (df[~m.values])
       A   B
B           
2  10:03 NaN
2  10:04 NaN
2  10:05 NaN
3  10:07 NaN
3  10:08 NaN

类似的解决方案Series

m = s.notnull()
cum = m.cumsum()
missing = s[~m.values].index.to_series().groupby(cum).agg(['first','last']).values.tolist()
print (missing)
[['10:03', '10:05'], ['10:07', '10:08']]

N_missing = len(missing)
print (N_missing)
2

推荐阅读