首页 > 解决方案 > Pandas 中的矢量化操作

问题描述

我在大型 Pandas DataFrame 上进行了此操作,当然速度非常慢。

def get_last_status_in_range(df, created_dt, created_id, window_size=15, gap_size=5):
    since = created_dt - timedelta(days=(window_size + gap_size))
    until = created_dt - timedelta(days=gap_size)
    try:
        status = df[(df.created_dt >= since) & (df.created_dt < until) &
                    (df.number_id == created_id)]['status'].iloc[-1]
    except IndexError:
        # Not found
        status = None
    return status

idx = 0
last_status_in_range = np.array([None] * len(df), dtype=str)
for row in df.itertuples():
    created_dt = row.created_dt
    created_id = row.number_id
    last_status_in_range[idx] = get_last_status_in_range(df, created_dt, created_id)
    idx += 1

我的目标是给出一个包含“created_dt”、“number_id”和“status”列的 DF,为每一行获取相同“number_id”但在过去指定日期范围内的最后一个“状态”。

到目前为止,我发现的唯一方法是执行上述操作,但是对于大型 DataFrame,它非常慢,而且我找不到矢量化方法来做到这一点。

如何使用同一 DataFrame 中的某些值对操作进行矢量化?

编辑:

给定以下DF:

In [120]: df
Out[120]: 
   number_id                 created_dt status
20     BBB 2018-05-18 20:28:51.388001      u
12     BBB 2018-05-19 12:28:51.388001      u
2      CCC 2018-05-19 23:28:51.388001      u
27     CCC 2018-05-20 22:28:51.388001      a
1      CCC 2018-05-21 05:28:51.388001      u
14     BBB 2018-05-21 12:28:51.388001      r
17     AAA 2018-05-24 21:28:51.388001      a
28     CCC 2018-05-30 16:28:51.388001      a
0      AAA 2018-05-31 23:28:51.388001      r
24     CCC 2018-06-01 00:28:51.388001      r
4      BBB 2018-06-01 11:28:51.388001      r
23     BBB 2018-06-01 19:28:51.388001      r
6      AAA 2018-06-03 14:28:51.388001      a
3      CCC 2018-06-04 15:28:51.388001      u
19     AAA 2018-06-05 06:28:51.388001      u
5      AAA 2018-06-05 20:28:51.388001      r
21     AAA 2018-06-06 04:28:51.388001      a
9      BBB 2018-06-06 18:28:51.388001      r
25     AAA 2018-06-07 15:28:51.388001      r
11     BBB 2018-06-08 09:28:51.388001      r
10     BBB 2018-06-08 21:28:51.388001      u
13     BBB 2018-06-09 04:28:51.388001      a
7      AAA 2018-06-09 16:28:51.388001      r
22     AAA 2018-06-12 07:28:51.388001      r
26     BBB 2018-06-13 03:28:51.388001      u
15     AAA 2018-06-14 08:28:51.388001      a
8      CCC 2018-06-14 14:28:51.388001      r
18     CCC 2018-06-15 17:28:51.388001      u
16     BBB 2018-06-16 02:28:51.388001      a
29     AAA 2018-06-16 08:28:51.388001      r
30     AAA 2018-06-17 02:28:51.388001      a

我希望输出是:

In [124]: df
Out[124]: 
   number_id                 created_dt status prev_status
20     BBB 2018-05-18 20:28:51.388001      u        None
12     BBB 2018-05-19 12:28:51.388001      u        None
2      CCC 2018-05-19 23:28:51.388001      u        None
27     CCC 2018-05-20 22:28:51.388001      a        None
1      CCC 2018-05-21 05:28:51.388001      u        None
14     BBB 2018-05-21 12:28:51.388001      r        None
17     AAA 2018-05-24 21:28:51.388001      a        None
28     CCC 2018-05-30 16:28:51.388001      a           u
0      AAA 2018-05-31 23:28:51.388001      r           a
24     CCC 2018-06-01 00:28:51.388001      r           u
4      BBB 2018-06-01 11:28:51.388001      r           r
23     BBB 2018-06-01 19:28:51.388001      r           r
6      AAA 2018-06-03 14:28:51.388001      a           a
3      CCC 2018-06-04 15:28:51.388001      u           u
19     AAA 2018-06-05 06:28:51.388001      u           a
5      AAA 2018-06-05 20:28:51.388001      r           a
21     AAA 2018-06-06 04:28:51.388001      a           r
9      BBB 2018-06-06 18:28:51.388001      r           r
25     AAA 2018-06-07 15:28:51.388001      r           r
11     BBB 2018-06-08 09:28:51.388001      r           r
10     BBB 2018-06-08 21:28:51.388001      u           r
13     BBB 2018-06-09 04:28:51.388001      a           r
7      AAA 2018-06-09 16:28:51.388001      r           a
22     AAA 2018-06-12 07:28:51.388001      r           a
26     BBB 2018-06-13 03:28:51.388001      u           r
15     AAA 2018-06-14 08:28:51.388001      a           r
8      CCC 2018-06-14 14:28:51.388001      r           u
18     CCC 2018-06-15 17:28:51.388001      u           u
16     BBB 2018-06-16 02:28:51.388001      a           a
29     AAA 2018-06-16 08:28:51.388001      r           r
30     AAA 2018-06-17 02:28:51.388001      a           r

如您所见,“prev_status”列中的值与匹配相同“number_id”的上一行的值相同(其中上一行是在将日期条件应用于“created_dt”列之后)

标签: pythonpandasnumpyvectorization

解决方案


该技术使用关系代数来加速运算,而不是矢量化

使用pandas.merge_asof,我们可以合并两个 DataFrame,从第二帧中选择最后一行,其中比较字段低于第一帧的比较字段。

创建一个名为 的列until。这是我们稍后将丢弃的临时列

df['until'] = df.created_dt - pd.Timedelta(days=5)

使用 until & created_dt 将 df 合并到自身上,即最后一行,这样右 dfcreated_dt位于左 df 之前until,并且number_id两个 df 相同

merged = pd.merge_asof(df, df, left_on='until', right_on='created_dt', by='number_id', suffixes=('', '_y'), allow_exact_matches=False)

设置status_y为之前的np.nan位置-created_dt_ycreated_dt20 days

merged.loc[~(merged.created_dt_y >= merged.created_dt - pd.Timedelta(days=20)), 'status_y'] = np.nan

在这里,我们必须否定after条件,因为merged.created_dt_y包含不匹配过滤器的空值。

最后,选择所需的列:

merged[['number_id', 'created_dt', 'status', 'status_y']]
# outputs:
   number_id                 created_dt status status_y
0        BBB 2018-05-18 20:28:51.388001      u      NaN
1        BBB 2018-05-19 12:28:51.388001      u      NaN
2        CCC 2018-05-19 23:28:51.388001      u      NaN
3        CCC 2018-05-20 22:28:51.388001      a      NaN
4        CCC 2018-05-21 05:28:51.388001      u      NaN
5        BBB 2018-05-21 12:28:51.388001      r      NaN
6        AAA 2018-05-24 21:28:51.388001      a      NaN
7        CCC 2018-05-30 16:28:51.388001      a        u
8        AAA 2018-05-31 23:28:51.388001      r        a
9        CCC 2018-06-01 00:28:51.388001      r        u
10       BBB 2018-06-01 11:28:51.388001      r        r
11       BBB 2018-06-01 19:28:51.388001      r        r
12       AAA 2018-06-03 14:28:51.388001      a        a
13       CCC 2018-06-04 15:28:51.388001      u        u
14       AAA 2018-06-05 06:28:51.388001      u        a
15       AAA 2018-06-05 20:28:51.388001      r        a
16       AAA 2018-06-06 04:28:51.388001      a        r
17       BBB 2018-06-06 18:28:51.388001      r        r
18       AAA 2018-06-07 15:28:51.388001      r        r
19       BBB 2018-06-08 09:28:51.388001      r        r
20       BBB 2018-06-08 21:28:51.388001      u        r
21       BBB 2018-06-09 04:28:51.388001      a        r
22       AAA 2018-06-09 16:28:51.388001      r        a
23       AAA 2018-06-12 07:28:51.388001      r        a
24       BBB 2018-06-13 03:28:51.388001      u        r
25       AAA 2018-06-14 08:28:51.388001      a        r
26       CCC 2018-06-14 14:28:51.388001      r        u
27       CCC 2018-06-15 17:28:51.388001      u        u
28       BBB 2018-06-16 02:28:51.388001      a        a
29       AAA 2018-06-16 08:28:51.388001      r        r
30       AAA 2018-06-17 02:28:51.388001      a        r

基准测试结果:

即使在 30 行的小型 DataFrame 上,我们也看到大约 7 倍的性能提升

%timeit slow(df)
# outputs:
41 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit fast(df)
# outputs:
5.69 ms ± 34 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

使用的代码:

def slow(df):
  idx = 0
  last_status_in_range = np.array([None] * len(df), dtype=str)
  for row in df.itertuples():
    created_dt = row.created_dt
    created_id = row.number_id
    last_status_in_range[idx] = get_last_status_in_range(df, created_dt, created_id)
    idx += 1
  return df.assign(prev_status=last_status_in_range)

def fast(df):
  d = df.assign(until = df.created_dt - pd.Timedelta(days=5))
  merged = pd.merge_asof(
      d, d, left_on='until', right_on='created_dt', 
      by='number_id', suffixes=('', '_y'), 
      allow_exact_matches=False
  )
  merged.loc[
      ~(merged.created_dt_y >= merged.created_dt - pd.Timedelta(days=20)), 
      'status_y'
  ] = np.nan
  return merged[['number_id', 'created_dt', 'status', 'status_y']]

推荐阅读