python - Pandas DataFrame:按列分组、按日期时间排序和按条件截断分组
问题描述
我有一个类似于以下内容的 Pandas DataFrame:
import pandas as pd
df = pd.DataFrame([['a', '2018-09-30 00:03:00', 'that is a glove'],
['b', '2018-09-30 00:04:00', 'this is a glove'],
['b', '2018-09-30 00:09:00', 'she has ball'],
['a', '2018-09-30 00:05:00', 'they have a ball'],
['a', '2018-09-30 00:01:00', 'she has a shoe'],
['c', '2018-09-30 00:04:00', 'I have a baseball'],
['a', '2018-09-30 00:02:00', 'this is a hat'],
['a', '2018-09-30 00:06:00', 'he has no helmet'],
['b', '2018-09-30 00:11:00', 'he has no shoe'],
['c', '2018-09-30 00:02:00', 'we have a hat'],
['a', '2018-09-30 00:04:00', 'we have a baseball'],
['c', '2018-09-30 00:06:00', 'they have no glove'],
],
columns=['id', 'time', 'equipment'])
id time equipment
0 a 2018-09-30 00:03:00 that is a glove
1 b 2018-09-30 00:04:00 this is a glove
2 b 2018-09-30 00:09:00 she has ball
3 a 2018-09-30 00:05:00 they have a ball
4 a 2018-09-30 00:01:00 she has a shoe
5 c 2018-09-30 00:04:00 I have a baseball
6 a 2018-09-30 00:02:00 this is a hat
7 a 2018-09-30 00:06:00 he has no helmet
8 b 2018-09-30 00:11:00 he has no shoe
9 c 2018-09-30 00:02:00 we have a hat
10 a 2018-09-30 00:04:00 we have a baseball
11 c 2018-09-30 00:06:00 they have no glove
我想做的是groupby
,id
在每个组中,按 排序,time
然后将每一行返回并包括包含“球”一词的行。到目前为止,我可以分组和排序:
df.groupby('id').apply(lambda x: x.sort_values(['time'], ascending=True)).reset_index(drop=True)
id time equipment
0 a 2018-09-30 00:01:00 she has a shoe
1 a 2018-09-30 00:02:00 this is a hat
2 a 2018-09-30 00:03:00 that is a glove
3 a 2018-09-30 00:04:00 we have a baseball
4 a 2018-09-30 00:05:00 they have a ball
5 a 2018-09-30 00:06:00 he has no helmet
6 b 2018-09-30 00:04:00 this is a glove
7 b 2018-09-30 00:09:00 she has ball
8 b 2018-09-30 00:11:00 he has no shoe
9 c 2018-09-30 00:02:00 we have a hat
10 c 2018-09-30 00:04:00 I have a baseball
11 c 2018-09-30 00:06:00 they have no glove
但是,我希望输出看起来像:
id time equipment
0 a 2018-09-30 00:01:00 she has a shoe
1 a 2018-09-30 00:02:00 this is a hat
2 a 2018-09-30 00:03:00 that is a glove
3 a 2018-09-30 00:04:00 we have a baseball
4 a 2018-09-30 00:05:00 they have a ball
6 b 2018-09-30 00:04:00 this is a glove
7 b 2018-09-30 00:09:00 she has ball
请注意,该组c
没有返回任何行,因为它没有包含单词“ball”的行。Groupc
有“棒球”这个词,但这不是我们正在寻找的匹配项。同样,请注意该组a
不会停在“棒球”行,因为我们停在“球”行。从速度角度和内存角度来看,实现这一目标的最有效方法是什么?
解决方案
继续你所做的:
new_df = df.groupby('id').apply(lambda x: x.sort_values(['time'], ascending=True)).reset_index(drop=True)
new_df["mask"] = new_df.groupby("id").apply(lambda x: x["equipment"].str.contains(r"\bball\b",regex=True)).reset_index(drop=True)
result = (new_df.groupby("id").apply(lambda x : x.iloc[:x.reset_index(drop=True)["mask"].
idxmax()+1 if x["equipment"].str.contains(r"\bball\b",regex=True).any() else 0])
.reset_index(drop=True).drop("mask",axis=1))
print (result)
#
id time equipment
0 a 2018-09-30 00:01:00 she has a shoe
1 a 2018-09-30 00:02:00 this is a hat
2 a 2018-09-30 00:03:00 that is a glove
3 a 2018-09-30 00:04:00 we have a baseball
4 a 2018-09-30 00:05:00 they have a ball
5 b 2018-09-30 00:04:00 this is a glove
6 b 2018-09-30 00:09:00 she has ball
7 d 2018-09-30 00:06:00 I have a ball
推荐阅读
- python - 映射一个类似sql的查询来过滤python中的字典列表
- php - 自定义模块 Drupal 8 的 Twig 模板中的访问变量
- c# - 如何设置鼠标滚轮滚动以在具有焦点的组件上工作,而不是在鼠标指针所在的组件上工作?
- swift - 文件夹的 macOS 安全范围 URL 书签
- arrays - 要散列的 Ruby 字符串数组
- r - 在 tidy 中创建递归变量
- node.js - 返回应用程序时,Google oAuth 登录操作失败
- c++ - 使用与 openmp C++ 并行的循环计算矩阵中每一行的最小值
- python - 为什么会出现“NameError: name 'draw_Objects' is not defined”
- python - 如何在 Airflow 运算符中跳过任务?