首页 > 解决方案 > Pandas 在大型数据帧上重新采样 n 个点

问题描述

假设我有以下数据框

    id                time        lat       long
0    1 2020-11-01 21:48:00  66.027694  12.627349
1    1 2020-11-01 21:49:00  66.027833  12.630198
2    1 2020-11-01 21:50:00  66.027900  12.635473
3    1 2020-11-01 21:51:00  66.027967  12.640748
4    1 2020-11-01 21:52:00  66.028350  12.643367
5    1 2020-11-01 21:53:00  66.028450  12.643948
6    1 2020-11-01 21:54:00  66.028183  12.643750
7    1 2020-11-01 21:55:00  66.027767  12.643016
8    2 2020-11-01 23:30:00  66.031667  12.639148
9    2 2020-11-01 23:31:00  66.034033  12.637517
10   2 2020-11-01 23:32:00  66.036950  12.636683
11   2 2020-11-01 23:33:00  66.039742  12.636417
12   2 2020-11-01 23:34:00  66.042533  12.636150
13   2 2020-11-01 23:35:00  66.044725  12.636541
14   2 2020-11-01 23:36:00  66.046867  12.637715
15   2 2020-11-01 23:37:00  66.050550  12.641467
16   2 2020-11-01 23:38:00  66.053014  12.644047
17   2 2020-11-01 23:39:00  66.055478  12.646627
18   2 2020-11-01 23:40:00  66.057942  12.649207
19   2 2020-11-01 23:41:00  66.060406  12.651788
20   2 2020-11-01 23:42:00  66.062869  12.654368
21   2 2020-11-01 23:43:00  66.065333  12.656948
22   2 2020-11-01 23:44:00  66.067255  12.658876
23   2 2020-11-01 23:45:00  66.069177  12.660804
24   2 2020-11-01 23:46:00  66.071098  12.662732

而且我想通过每个组的 id 号对每个组进行重新采样,所以我为每个组得到 5 个等距分布的点(在时间上)。

从上面的例子看,结果应该是这样的。

   id                time        lat       long
0   1 2020-11-01 21:47:15  66.027694  12.627349
1   1 2020-11-01 21:49:00  66.027867  12.632836
2   1 2020-11-01 21:50:45  66.028158  12.642057
3   1 2020-11-01 21:52:30  66.028317  12.643849
4   1 2020-11-01 21:54:15  66.027767  12.643016
5   2 2020-11-01 23:28:00  66.032850  12.638333
6   2 2020-11-01 23:32:00  66.040987  12.636448
7   2 2020-11-01 23:36:00  66.051477  12.642464
8   2 2020-11-01 23:40:00  66.061638  12.653078
9   2 2020-11-01 23:44:00  66.069177  12.660804

我已经解决了它并得到了想要的结果,但它的速度很慢,因为我没有 25 行,而是 +1000 万行。

有一个比我的更好的解决方案

我的代码是:

# Define amount of points
points = 5

# route is the input dataframe (see first table from above)
groups = route.groupby('id')

# 'times' is for getting the first and last time in each group
times = groups['time'].agg(['first','last']).reset_index()

# Calculation the time step for getting 5 datapoints
times['diff'] = (times['last'] - times['first'])/(points-1)

# For saving each series of points
waypoints = []
for (name, group), (time_name, time_group) in zip(groups, times_groups):
   # Time step to string in seconds (Not the best solution)
   str_time = "{0}s".format(int(time_group['diff'].iloc[0].total_seconds()))
   # Saving points
   waypoints.append(
      group.set_index('time').groupby(
         'id'
      ).resample(
         str_time
      ).mean().interpolate('linear').drop('id', axis = 1).reset_index()
   )
# Concatenate back to dataframe (see last table from above)
pd_waypoints = pd.concat(waypoints).reset_index()

标签: pythonpandas

解决方案


这是加快速度的一种方法。这个想法是复制什么resample,这本质上是一个截断时间的 groupby,但对不同的 id 使用不同的频率,而不是一个一个地遍历组(除了计算频率):

# make a copy of the route dataframe to work on
z = route.copy()

# calculate frequency f in seconds for each id
# and t0 as the midnight of the first day of the group
g = z.groupby('id')['time']
z['f'] = (g.transform('max') - g.transform('min')).astype(int) / (points - 1) // 10**9
z['t0'] = g.transform('min').dt.floor('d').astype(int) // 10**9

# calculate seconds since t0
# this is what .resample(...) operates on
z['s_since_t0'] = z['time'].astype(int) // 10**9 - z['t0']

# get grouped seconds since t0
# in the same way that .resample(...) does
z['s_group'] = z['t0'] + z['s_since_t0'] // z['f'] * z['f']

# convert grouped seconds to datetime
z['time_group'] = pd.to_datetime(z['s_group'], unit='s')

# calculate mean
z.groupby(['id', 'time_group'])[['lat', 'long']].mean().reset_index()

输出:

   id          time_group        lat       long
0   1 2020-11-01 21:47:15  66.027694  12.627349
1   1 2020-11-01 21:49:00  66.027867  12.632835
2   1 2020-11-01 21:50:45  66.028159  12.642057
3   1 2020-11-01 21:52:30  66.028317  12.643849
4   1 2020-11-01 21:54:15  66.027767  12.643016
5   2 2020-11-01 23:28:00  66.032850  12.638332
6   2 2020-11-01 23:32:00  66.040987  12.636448
7   2 2020-11-01 23:36:00  66.051477  12.642464
8   2 2020-11-01 23:40:00  66.061638  12.653078
9   2 2020-11-01 23:44:00  66.069177  12.660804

在 10k 数据集上,此版本比原始版本快约 400 倍:

%%timeit
original()

3.72 s ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
proposed()

8.83 ms ± 43.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

推荐阅读