首页 > 解决方案 > 有效地从具有 groupby 和时间条件的 Pandas DataFrame 中删除行?

问题描述

我有一个包含数千万行的数据框:

| userId | pageId | bannerId | timestap            |
|--------+--------+----------+---------------------|
| A      | P1     | B1       | 2020-10-10 01:00:00 |
| A      | P1     | B1       | 2020-10-10 01:00:10 |
| B      | P1     | B1       | 2020-10-10 01:00:00 |
| B      | P2     | B2       | 2020-10-10 02:00:00 |

我想要做的是删除所有相同的行userId, pageId, bannerId,时间戳在n同一对的上一次出现的几分钟内userId, pageId, bannerId

我现在在做什么:

# Get all instances of `userId, pageId, bannerId` that repeats,
# although, not all of them will have repeated within the `n` minute
# threshold I'm interested in.
groups = in df.groupby(['userId', 'pageId', 'bannerId']).userId.count()

# Iterate through each group, and manually check if the repetition was
# within `n` minutes. Keep track of all IDs to be removed.
to_remove = []
for user_id, page_id, banner_id in groups.index:
   sub = df.loc[
      (df.userId == user_id) &
      (df.pageId == pageId) &
      (df.bannerId == bannerId)
   ].sort_values('timestamp')

   # Now that each occurrence is listed chronologically,
   # check time diff.
   sub = sub.loc[
     ((sub.timestamp.shift(1) - sub.timestamp) / pd.Timedelta(minutes=1)).abs() <= n
   ]

   if sub.shape[0] > 0:
      to_remove += sub.index.tolist()

这确实如我所愿。唯一的问题是,我拥有大量数据,需要数小时才能完成。

标签: pandasdataframepandas-groupby

解决方案


为了获得更有启发性的结果,我使用了更长的源 DataFrame:

  userId pageId bannerId            timestap
0      A     P1       B1 2020-10-10 01:00:00
1      A     P1       B1 2020-10-10 01:04:10
2      A     P1       B1 2020-10-10 01:05:00
3      A     P1       B1 2020-10-10 01:08:20
4      A     P1       B1 2020-10-10 01:09:30
5      A     P1       B1 2020-10-10 01:11:00
6      B     P1       B1 2020-10-10 01:00:00
7      B     P2       B2 2020-10-10 02:00:00

注意:timestap列是datetime类型。

从为一组时间戳值定义一个“过滤”函数开始(对于userIdpageIdbannerId的某种组合):

def myFilter(grp, nMin):
    prevTs = np.nan
    grp = grp.sort_values()
    res = []
    for ts in grp:
        if pd.isna(prevTs) or (ts - prevTs) / pd.Timedelta(1, 'm') >= nMin:
            prevTs = ts
            res.append(ts)
    return res

然后设置时间阈值(分钟数):

nMin = 5

最后一件事是生成结果:

result = df.groupby(['userId', 'pageId', 'bannerId'])\
    .timestap.apply(myFilter, nMin).explode().reset_index()

对于我的数据样本,结果是:

  userId pageId bannerId            timestap
0      A     P1       B1 2020-10-10 01:00:00
1      A     P1       B1 2020-10-10 01:05:00
2      A     P1       B1 2020-10-10 01:11:00
3      B     P1       B1 2020-10-10 01:00:00
4      B     P2       B2 2020-10-10 02:00:00

请注意,“普通”差异是不够的,因为例如。从时间戳为 01:05:00的行开始,应删除以下两行(01:08:2001:09:30),因为它们在01:05:00的 5 分钟限制内。

所以只看前一行是不够的。从某行开始,您应该“标记为删除”所有后续行,直到您找到时间戳比“开始行”更远或至少等距的行。在这种情况下,仅此行成为分析后续行(在当前组内)的起始行。


推荐阅读