首页 > 解决方案 > 根据前后行中的值填充缺失值

问题描述

我有一个类似于下面的数据集,对于一个网站,我有两年(2001-2002 年)每月的浏览量。但是,由于收集数据的方式,如果网站的浏览量 > 0,我只有该网站的信息。所以,我试图填写几个月的浏览次数,但情况并非如此:即网站在线但没有浏览次数的情况。

不幸的是,我没有关于该网站何时首次发布的信息,所以我假设它是在一个月内第一次出现非零值时引入的。我还假设如果有连续几个月的np.nan值在 2002 年底,则该网站已被删除。

因此,目前,该Viewsnp.nan的两个月的浏览量均为零,并且该网站根本不在线。

我想确保查看次数为零的月份在Views列中为 0,例如下面的数据框,

Website ,Month,Year ,Views
1,January,2001,
1,February,2001,
1,March,2001,3.0
1,April,2001,4.0
1,May,2001,23.0
1,June,2001,
1,July,2001,5.0
1,August,2001,4.0
1,September,2001,3.0
1,October,2001,3.0
1,November,2001,3.0
1,December,2001,35.0
1,January,2002,6.0
1,February,2002,
1,March,2002,3.0
1,April,2002,
1,May,2002,
1,June,2002,3.0
1,July,2002,3.0
1,August,2002,2.0
1,September,2002,
1,October,2002,
1,November,2002,
1,December,2002,
2,January,2001,3.0
2,February,2001,1.0
2,March,2001,2.0
2,April,2001,2.0
2,May,2001,22.0
2,June,2001,
2,July,2001,4.0
2,August,2001,3.0
2,September,2001,3.0
2,October,2001,4.0
2,November,2001,
2,December,2001,1.0
2,January,2002,
2,February,2002,4.0
2,March,2002,2.0
2,April,2002,5.0
2,May,2002,2.0
2,June,2002,
2,July,2002,2.0
2,August,2002,3.0
2,September,2002,
2,October,2002,
2,November,2002,2.0
2,December,2002,5.0

看起来像这样:

Website ,Month,Year ,Views
1,January,2001,
1,February,2001,
1,March,2001,3.0
1,April,2001,4.0
1,May,2001,23.0
1,June,2001,0.0
1,July,2001,5.0
1,August,2001,4.0
1,September,2001,3.0
1,October,2001,3.0
1,November,2001,3.0
1,December,2001,35.0
1,January,2002,6.0
1,February,2002,0.0
1,March,2002,3.0
1,April,2002,0.0
1,May,2002,0.0
1,June,2002,3.0
1,July,2002,3.0
1,August,2002,2.0
1,September,2002,
1,October,2002,
1,November,2002,
1,December,2002,
2,January,2001,3.0
2,February,2001,1.0
2,March,2001,2.0
2,April,2001,2.0
2,May,2001,22.0
2,June,2001,0.0
2,July,2001,4.0
2,August,2001,3.0
2,September,2001,3.0
2,October,2001,4.0
2,November,2001,0.0
2,December,2001,1.0
2,January,2002,0.0
2,February,2002,4.0
2,March,2002,2.0
2,April,2002,5.0
2,May,2002,2.0
2,June,2002,0.0
2,July,2002,2.0
2,August,2002,3.0
2,September,2002,0.0
2,October,2002,0.0
2,November,2002,2.0
2,December,2002,5.0

换句话说,如果该网站的所有前几个月都显示np.nan值,而当前值为np.nan,则应该保持这种状态。同样,如果接下来的所有月份都显示np.nan,则该列也应保留np.nan。但是,如果至少前一个月不是np.nan,则该值应更改为 0,等等。

棘手的部分是我的数据集大约有 4,000,000 行,我需要一种相当有效的方法来做到这一点。

有没有人有什么建议?

标签: pythonpandaslarge-data

解决方案


这是我的方法

# s counts the non-null views so far
s = df['Views'].notnull().groupby(df['Website']).cumsum()

# fill the null only where s > 0
df['Views'] = np.where(df['Views'].isna() & s.gt(0), 0, df['Views'])

# equivalent
# df.loc[df['View'].isna() & s.gt(0), 'Views'] = 0

推荐阅读