python - Pandas 中的矢量化操作
问题描述
我在大型 Pandas DataFrame 上进行了此操作,当然速度非常慢。
def get_last_status_in_range(df, created_dt, created_id, window_size=15, gap_size=5):
since = created_dt - timedelta(days=(window_size + gap_size))
until = created_dt - timedelta(days=gap_size)
try:
status = df[(df.created_dt >= since) & (df.created_dt < until) &
(df.number_id == created_id)]['status'].iloc[-1]
except IndexError:
# Not found
status = None
return status
idx = 0
last_status_in_range = np.array([None] * len(df), dtype=str)
for row in df.itertuples():
created_dt = row.created_dt
created_id = row.number_id
last_status_in_range[idx] = get_last_status_in_range(df, created_dt, created_id)
idx += 1
我的目标是给出一个包含“created_dt”、“number_id”和“status”列的 DF,为每一行获取相同“number_id”但在过去指定日期范围内的最后一个“状态”。
到目前为止,我发现的唯一方法是执行上述操作,但是对于大型 DataFrame,它非常慢,而且我找不到矢量化方法来做到这一点。
如何使用同一 DataFrame 中的某些值对操作进行矢量化?
编辑:
给定以下DF:
In [120]: df
Out[120]:
number_id created_dt status
20 BBB 2018-05-18 20:28:51.388001 u
12 BBB 2018-05-19 12:28:51.388001 u
2 CCC 2018-05-19 23:28:51.388001 u
27 CCC 2018-05-20 22:28:51.388001 a
1 CCC 2018-05-21 05:28:51.388001 u
14 BBB 2018-05-21 12:28:51.388001 r
17 AAA 2018-05-24 21:28:51.388001 a
28 CCC 2018-05-30 16:28:51.388001 a
0 AAA 2018-05-31 23:28:51.388001 r
24 CCC 2018-06-01 00:28:51.388001 r
4 BBB 2018-06-01 11:28:51.388001 r
23 BBB 2018-06-01 19:28:51.388001 r
6 AAA 2018-06-03 14:28:51.388001 a
3 CCC 2018-06-04 15:28:51.388001 u
19 AAA 2018-06-05 06:28:51.388001 u
5 AAA 2018-06-05 20:28:51.388001 r
21 AAA 2018-06-06 04:28:51.388001 a
9 BBB 2018-06-06 18:28:51.388001 r
25 AAA 2018-06-07 15:28:51.388001 r
11 BBB 2018-06-08 09:28:51.388001 r
10 BBB 2018-06-08 21:28:51.388001 u
13 BBB 2018-06-09 04:28:51.388001 a
7 AAA 2018-06-09 16:28:51.388001 r
22 AAA 2018-06-12 07:28:51.388001 r
26 BBB 2018-06-13 03:28:51.388001 u
15 AAA 2018-06-14 08:28:51.388001 a
8 CCC 2018-06-14 14:28:51.388001 r
18 CCC 2018-06-15 17:28:51.388001 u
16 BBB 2018-06-16 02:28:51.388001 a
29 AAA 2018-06-16 08:28:51.388001 r
30 AAA 2018-06-17 02:28:51.388001 a
我希望输出是:
In [124]: df
Out[124]:
number_id created_dt status prev_status
20 BBB 2018-05-18 20:28:51.388001 u None
12 BBB 2018-05-19 12:28:51.388001 u None
2 CCC 2018-05-19 23:28:51.388001 u None
27 CCC 2018-05-20 22:28:51.388001 a None
1 CCC 2018-05-21 05:28:51.388001 u None
14 BBB 2018-05-21 12:28:51.388001 r None
17 AAA 2018-05-24 21:28:51.388001 a None
28 CCC 2018-05-30 16:28:51.388001 a u
0 AAA 2018-05-31 23:28:51.388001 r a
24 CCC 2018-06-01 00:28:51.388001 r u
4 BBB 2018-06-01 11:28:51.388001 r r
23 BBB 2018-06-01 19:28:51.388001 r r
6 AAA 2018-06-03 14:28:51.388001 a a
3 CCC 2018-06-04 15:28:51.388001 u u
19 AAA 2018-06-05 06:28:51.388001 u a
5 AAA 2018-06-05 20:28:51.388001 r a
21 AAA 2018-06-06 04:28:51.388001 a r
9 BBB 2018-06-06 18:28:51.388001 r r
25 AAA 2018-06-07 15:28:51.388001 r r
11 BBB 2018-06-08 09:28:51.388001 r r
10 BBB 2018-06-08 21:28:51.388001 u r
13 BBB 2018-06-09 04:28:51.388001 a r
7 AAA 2018-06-09 16:28:51.388001 r a
22 AAA 2018-06-12 07:28:51.388001 r a
26 BBB 2018-06-13 03:28:51.388001 u r
15 AAA 2018-06-14 08:28:51.388001 a r
8 CCC 2018-06-14 14:28:51.388001 r u
18 CCC 2018-06-15 17:28:51.388001 u u
16 BBB 2018-06-16 02:28:51.388001 a a
29 AAA 2018-06-16 08:28:51.388001 r r
30 AAA 2018-06-17 02:28:51.388001 a r
如您所见,“prev_status”列中的值与匹配相同“number_id”的上一行的值相同(其中上一行是在将日期条件应用于“created_dt”列之后)
解决方案
该技术使用关系代数来加速运算,而不是矢量化
使用pandas.merge_asof
,我们可以合并两个 DataFrame,从第二帧中选择最后一行,其中比较字段低于第一帧的比较字段。
创建一个名为 的列until
。这是我们稍后将丢弃的临时列
df['until'] = df.created_dt - pd.Timedelta(days=5)
使用 until & created_dt 将 df 合并到自身上,即最后一行,这样右 dfcreated_dt
位于左 df 之前until
,并且number_id
两个 df 相同
merged = pd.merge_asof(df, df, left_on='until', right_on='created_dt', by='number_id', suffixes=('', '_y'), allow_exact_matches=False)
设置status_y
为之前的np.nan
位置-created_dt_y
created_dt
20 days
merged.loc[~(merged.created_dt_y >= merged.created_dt - pd.Timedelta(days=20)), 'status_y'] = np.nan
在这里,我们必须否定after条件,因为merged.created_dt_y
包含不匹配过滤器的空值。
最后,选择所需的列:
merged[['number_id', 'created_dt', 'status', 'status_y']]
# outputs:
number_id created_dt status status_y
0 BBB 2018-05-18 20:28:51.388001 u NaN
1 BBB 2018-05-19 12:28:51.388001 u NaN
2 CCC 2018-05-19 23:28:51.388001 u NaN
3 CCC 2018-05-20 22:28:51.388001 a NaN
4 CCC 2018-05-21 05:28:51.388001 u NaN
5 BBB 2018-05-21 12:28:51.388001 r NaN
6 AAA 2018-05-24 21:28:51.388001 a NaN
7 CCC 2018-05-30 16:28:51.388001 a u
8 AAA 2018-05-31 23:28:51.388001 r a
9 CCC 2018-06-01 00:28:51.388001 r u
10 BBB 2018-06-01 11:28:51.388001 r r
11 BBB 2018-06-01 19:28:51.388001 r r
12 AAA 2018-06-03 14:28:51.388001 a a
13 CCC 2018-06-04 15:28:51.388001 u u
14 AAA 2018-06-05 06:28:51.388001 u a
15 AAA 2018-06-05 20:28:51.388001 r a
16 AAA 2018-06-06 04:28:51.388001 a r
17 BBB 2018-06-06 18:28:51.388001 r r
18 AAA 2018-06-07 15:28:51.388001 r r
19 BBB 2018-06-08 09:28:51.388001 r r
20 BBB 2018-06-08 21:28:51.388001 u r
21 BBB 2018-06-09 04:28:51.388001 a r
22 AAA 2018-06-09 16:28:51.388001 r a
23 AAA 2018-06-12 07:28:51.388001 r a
24 BBB 2018-06-13 03:28:51.388001 u r
25 AAA 2018-06-14 08:28:51.388001 a r
26 CCC 2018-06-14 14:28:51.388001 r u
27 CCC 2018-06-15 17:28:51.388001 u u
28 BBB 2018-06-16 02:28:51.388001 a a
29 AAA 2018-06-16 08:28:51.388001 r r
30 AAA 2018-06-17 02:28:51.388001 a r
基准测试结果:
即使在 30 行的小型 DataFrame 上,我们也看到大约 7 倍的性能提升
%timeit slow(df)
# outputs:
41 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit fast(df)
# outputs:
5.69 ms ± 34 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
使用的代码:
def slow(df):
idx = 0
last_status_in_range = np.array([None] * len(df), dtype=str)
for row in df.itertuples():
created_dt = row.created_dt
created_id = row.number_id
last_status_in_range[idx] = get_last_status_in_range(df, created_dt, created_id)
idx += 1
return df.assign(prev_status=last_status_in_range)
def fast(df):
d = df.assign(until = df.created_dt - pd.Timedelta(days=5))
merged = pd.merge_asof(
d, d, left_on='until', right_on='created_dt',
by='number_id', suffixes=('', '_y'),
allow_exact_matches=False
)
merged.loc[
~(merged.created_dt_y >= merged.created_dt - pd.Timedelta(days=20)),
'status_y'
] = np.nan
return merged[['number_id', 'created_dt', 'status', 'status_y']]
推荐阅读
- c++ - 获取从其他模块访问变量的路径
- go - 根据服务请求发布部署 Pod 创建/删除
- java - 翻译转换在按键上无法正常工作
- nginx - 将 websocket 转发到 websocket
- c# - 可以覆盖sql字符串中的查询吗?C#
- python - 如何从python中的列表中删除相似的关键字?
- android - CompanionDeviceManager 关联方法不调用回调方法
- json - 在 Angular 项目的顶部和侧面标题表中显示 2 级 json 键值对
- python - 使用文件名中的日期对 CSV 文件进行排序
- sql - MSSQL如何处理空值以在数据透视中计算它们