首页 > 解决方案 > 根据分组获取连续出现

问题描述

我正在尝试找到一种方法来获取按主机分组并按时间排序的连续事件组。理想情况下,我需要满足一定门槛的群体,并且isCorrect == false

例子

Time    |   Host    |   isCorrect   |
-------------------------------------
10:01   |   HostA   |   true        |
10:02   |   HostB   |   true        |
10:03   |   HostA   |   false       |
10:15   |   HostA   |   false       |
10:16   |   HostA   |   false       |
10:18   |   HostB   |   false       |
10:20   |   HostA   |   true        |
10:21   |   HostA   |   true        |
10:22   |   HostB   |   false       |
10:23   |   HostB   |   false       |

阈值:>=3

将导致 2 组

Time    |   Host    |   isCorrect   | Group
--------------------------------------------
10:03   |   HostA   |   false       |1
10:15   |   HostA   |   false       |1
10:16   |   HostA   |   false       |1

10:18   |   HostB   |   false       |2
10:22   |   HostB   |   false       |2
10:23   |   HostB   |   false       |2

我正在阅读https://towardsdatascience.com/pandas-dataframe-group-by-consecutive-certain-values-a6ed8e5d8cc但找不到先按主机分组的方法。

标签: pythonpandas

解决方案


首先False通过反转掩码~和排序值(如有必要)过滤值,然后使用阈值过滤组,最后按以下方式创建Groupfactorize

df = df[~df['isCorrect']].sort_values(['Host','Time'])
mask = df['Host'].map(df['Host'].value_counts()) >= 3

df = df[mask].copy()
df['Group'] = pd.factorize(df['Host'])[0] + 1
print (df)

    Time   Host  isCorrect  Group
2  10:03  HostA      False      1
3  10:15  HostA      False      1
4  10:16  HostA      False      1
5  10:18  HostB      False      2
8  10:22  HostB      False      2
9  10:23  HostB      False      2

如果按连续Falses 分组:

m = ~df['isCorrect']
df['Group'] = df['isCorrect'].cumsum()[m]

df = df[m].sort_values(['Host','Time'])

mask = df.groupby(['Group', 'Host'])['Group'].transform('size') >= 3

df = df[mask].copy()
df['Group'] = pd.factorize(df['Host'])[0] + 1
print (df)
    Time   Host  isCorrect  Group
2  10:03  HostA      False      1
3  10:15  HostA      False      1
4  10:16  HostA      False      1

推荐阅读