python - Pandas 在大型数据帧上重新采样 n 个点
问题描述
假设我有以下数据框
id time lat long
0 1 2020-11-01 21:48:00 66.027694 12.627349
1 1 2020-11-01 21:49:00 66.027833 12.630198
2 1 2020-11-01 21:50:00 66.027900 12.635473
3 1 2020-11-01 21:51:00 66.027967 12.640748
4 1 2020-11-01 21:52:00 66.028350 12.643367
5 1 2020-11-01 21:53:00 66.028450 12.643948
6 1 2020-11-01 21:54:00 66.028183 12.643750
7 1 2020-11-01 21:55:00 66.027767 12.643016
8 2 2020-11-01 23:30:00 66.031667 12.639148
9 2 2020-11-01 23:31:00 66.034033 12.637517
10 2 2020-11-01 23:32:00 66.036950 12.636683
11 2 2020-11-01 23:33:00 66.039742 12.636417
12 2 2020-11-01 23:34:00 66.042533 12.636150
13 2 2020-11-01 23:35:00 66.044725 12.636541
14 2 2020-11-01 23:36:00 66.046867 12.637715
15 2 2020-11-01 23:37:00 66.050550 12.641467
16 2 2020-11-01 23:38:00 66.053014 12.644047
17 2 2020-11-01 23:39:00 66.055478 12.646627
18 2 2020-11-01 23:40:00 66.057942 12.649207
19 2 2020-11-01 23:41:00 66.060406 12.651788
20 2 2020-11-01 23:42:00 66.062869 12.654368
21 2 2020-11-01 23:43:00 66.065333 12.656948
22 2 2020-11-01 23:44:00 66.067255 12.658876
23 2 2020-11-01 23:45:00 66.069177 12.660804
24 2 2020-11-01 23:46:00 66.071098 12.662732
而且我想通过每个组的 id 号对每个组进行重新采样,所以我为每个组得到 5 个等距分布的点(在时间上)。
从上面的例子看,结果应该是这样的。
id time lat long
0 1 2020-11-01 21:47:15 66.027694 12.627349
1 1 2020-11-01 21:49:00 66.027867 12.632836
2 1 2020-11-01 21:50:45 66.028158 12.642057
3 1 2020-11-01 21:52:30 66.028317 12.643849
4 1 2020-11-01 21:54:15 66.027767 12.643016
5 2 2020-11-01 23:28:00 66.032850 12.638333
6 2 2020-11-01 23:32:00 66.040987 12.636448
7 2 2020-11-01 23:36:00 66.051477 12.642464
8 2 2020-11-01 23:40:00 66.061638 12.653078
9 2 2020-11-01 23:44:00 66.069177 12.660804
我已经解决了它并得到了想要的结果,但它的速度很慢,因为我没有 25 行,而是 +1000 万行。
有一个比我的更好的解决方案
我的代码是:
# Define amount of points
points = 5
# route is the input dataframe (see first table from above)
groups = route.groupby('id')
# 'times' is for getting the first and last time in each group
times = groups['time'].agg(['first','last']).reset_index()
# Calculation the time step for getting 5 datapoints
times['diff'] = (times['last'] - times['first'])/(points-1)
# For saving each series of points
waypoints = []
for (name, group), (time_name, time_group) in zip(groups, times_groups):
# Time step to string in seconds (Not the best solution)
str_time = "{0}s".format(int(time_group['diff'].iloc[0].total_seconds()))
# Saving points
waypoints.append(
group.set_index('time').groupby(
'id'
).resample(
str_time
).mean().interpolate('linear').drop('id', axis = 1).reset_index()
)
# Concatenate back to dataframe (see last table from above)
pd_waypoints = pd.concat(waypoints).reset_index()
解决方案
这是加快速度的一种方法。这个想法是复制什么resample
,这本质上是一个截断时间的 groupby,但对不同的 id 使用不同的频率,而不是一个一个地遍历组(除了计算频率):
# make a copy of the route dataframe to work on
z = route.copy()
# calculate frequency f in seconds for each id
# and t0 as the midnight of the first day of the group
g = z.groupby('id')['time']
z['f'] = (g.transform('max') - g.transform('min')).astype(int) / (points - 1) // 10**9
z['t0'] = g.transform('min').dt.floor('d').astype(int) // 10**9
# calculate seconds since t0
# this is what .resample(...) operates on
z['s_since_t0'] = z['time'].astype(int) // 10**9 - z['t0']
# get grouped seconds since t0
# in the same way that .resample(...) does
z['s_group'] = z['t0'] + z['s_since_t0'] // z['f'] * z['f']
# convert grouped seconds to datetime
z['time_group'] = pd.to_datetime(z['s_group'], unit='s')
# calculate mean
z.groupby(['id', 'time_group'])[['lat', 'long']].mean().reset_index()
输出:
id time_group lat long
0 1 2020-11-01 21:47:15 66.027694 12.627349
1 1 2020-11-01 21:49:00 66.027867 12.632835
2 1 2020-11-01 21:50:45 66.028159 12.642057
3 1 2020-11-01 21:52:30 66.028317 12.643849
4 1 2020-11-01 21:54:15 66.027767 12.643016
5 2 2020-11-01 23:28:00 66.032850 12.638332
6 2 2020-11-01 23:32:00 66.040987 12.636448
7 2 2020-11-01 23:36:00 66.051477 12.642464
8 2 2020-11-01 23:40:00 66.061638 12.653078
9 2 2020-11-01 23:44:00 66.069177 12.660804
在 10k 数据集上,此版本比原始版本快约 400 倍:
%%timeit
original()
3.72 s ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
proposed()
8.83 ms ± 43.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
推荐阅读
- visual-studio-code - 更好的 vscode yaml 组高亮可能吗?
- regression - LightGBM 回归器的置信区间
- node.js - 'mongodump' 不是内部或外部命令、可运行程序或批处理文件
- java - Spring @Repository @Document 不是必需的吗?
- webrtc - h.264的延迟超过800ms,而vp9的延迟接近500ms
- c++ - 在 Windows 10 上使用 Eclipse C/C++ IDE 2020-06 的 SDL 测试项目构建失败错误消息
- reactjs - 如何使用 reactjs 从 JSON 对象中控制台 id
- blogdown - Blogdown 显示图像,但在新帖子上渲染后不显示图(旧帖子仍然可以正常工作并正确渲染)
- git - 依赖分支审查缓慢
- r - 百分比标签不显示在条形图上