首页 > 解决方案 > Pandas DataFrame:按列分组、按日期时间排序和按条件截断分组

问题描述

我有一个类似于以下内容的 Pandas DataFrame:

import pandas as pd

df = pd.DataFrame([['a', '2018-09-30 00:03:00', 'that is a glove'],
                   ['b', '2018-09-30 00:04:00', 'this is a glove'],
                   ['b', '2018-09-30 00:09:00', 'she has ball'],
                   ['a', '2018-09-30 00:05:00', 'they have a ball'],
                   ['a', '2018-09-30 00:01:00', 'she has a shoe'],
                   ['c', '2018-09-30 00:04:00', 'I have a baseball'],
                   ['a', '2018-09-30 00:02:00', 'this is a hat'],
                   ['a', '2018-09-30 00:06:00', 'he has no helmet'],
                   ['b', '2018-09-30 00:11:00', 'he has no shoe'],
                   ['c', '2018-09-30 00:02:00', 'we have a hat'],
                   ['a', '2018-09-30 00:04:00', 'we have a baseball'],
                   ['c', '2018-09-30 00:06:00', 'they have no glove'],
                   ], 
                  columns=['id', 'time', 'equipment'])


   id                 time           equipment
0   a  2018-09-30 00:03:00     that is a glove
1   b  2018-09-30 00:04:00     this is a glove
2   b  2018-09-30 00:09:00        she has ball
3   a  2018-09-30 00:05:00    they have a ball
4   a  2018-09-30 00:01:00      she has a shoe
5   c  2018-09-30 00:04:00   I have a baseball
6   a  2018-09-30 00:02:00       this is a hat
7   a  2018-09-30 00:06:00    he has no helmet
8   b  2018-09-30 00:11:00      he has no shoe
9   c  2018-09-30 00:02:00       we have a hat
10  a  2018-09-30 00:04:00  we have a baseball
11  c  2018-09-30 00:06:00  they have no glove

我想做的是groupbyid在每个组中,按 排序,time然后将每一行返回并包括包含“球”一词的行。到目前为止,我可以分组和排序:

df.groupby('id').apply(lambda x: x.sort_values(['time'], ascending=True)).reset_index(drop=True)


   id                 time           equipment
0   a  2018-09-30 00:01:00      she has a shoe
1   a  2018-09-30 00:02:00       this is a hat
2   a  2018-09-30 00:03:00     that is a glove
3   a  2018-09-30 00:04:00  we have a baseball
4   a  2018-09-30 00:05:00    they have a ball
5   a  2018-09-30 00:06:00    he has no helmet
6   b  2018-09-30 00:04:00     this is a glove
7   b  2018-09-30 00:09:00        she has ball
8   b  2018-09-30 00:11:00      he has no shoe
9   c  2018-09-30 00:02:00       we have a hat
10  c  2018-09-30 00:04:00   I have a baseball
11  c  2018-09-30 00:06:00  they have no glove

但是,我希望输出看起来像:

   id                 time           equipment
0   a  2018-09-30 00:01:00      she has a shoe
1   a  2018-09-30 00:02:00       this is a hat
2   a  2018-09-30 00:03:00     that is a glove
3   a  2018-09-30 00:04:00  we have a baseball
4   a  2018-09-30 00:05:00    they have a ball
6   b  2018-09-30 00:04:00     this is a glove
7   b  2018-09-30 00:09:00        she has ball

请注意,该组c没有返回任何行,因为它没有包含单词“ball”的行。Groupc有“棒球”这个词,但这不是我们正在寻找的匹配项。同样,请注意该组a不会停在“棒球”行,因为我们停在“球”行。从速度角度和内存角度来看,实现这一目标的最有效方法是什么?

标签: pythonpandasdataframe

解决方案


继续你所做的:

new_df = df.groupby('id').apply(lambda x: x.sort_values(['time'], ascending=True)).reset_index(drop=True)

new_df["mask"] = new_df.groupby("id").apply(lambda x: x["equipment"].str.contains(r"\bball\b",regex=True)).reset_index(drop=True)

result = (new_df.groupby("id").apply(lambda x : x.iloc[:x.reset_index(drop=True)["mask"].
                                     idxmax()+1 if x["equipment"].str.contains(r"\bball\b",regex=True).any() else 0])
          .reset_index(drop=True).drop("mask",axis=1))

print (result)

#
  id                 time           equipment
0  a  2018-09-30 00:01:00      she has a shoe
1  a  2018-09-30 00:02:00       this is a hat
2  a  2018-09-30 00:03:00     that is a glove
3  a  2018-09-30 00:04:00  we have a baseball
4  a  2018-09-30 00:05:00    they have a ball
5  b  2018-09-30 00:04:00     this is a glove
6  b  2018-09-30 00:09:00        she has ball
7  d  2018-09-30 00:06:00       I have a ball

推荐阅读