python - 计算熊猫行组之间时间差的快速有效方法?
问题描述
假设我在 DataFrame 中有这张表,其中包含几辆汽车已重新装满的日期:
+-------+-------------+
| carId | refill_date |
+-------+-------------+
| 1 | 2020-03-01 |
+-------+-------------+
| 1 | 2020-03-12 |
+-------+-------------+
| 1 | 2020-04-04 |
+-------+-------------+
| 2 | 2020-03-07 |
+-------+-------------+
| 2 | 2020-03-26 |
+-------+-------------+
| 2 | 2020-04-01 |
+-------+-------------+
我想添加一个第三列,time_elapsed
,每次补充之间的持续时间。
+-------+-------------+--------------+
| carId | refill_date | time_elapsed |
+-------+-------------+--------------+
| 1 | 2020-03-01 | |
+-------+-------------+--------------+
| 1 | 2020-03-12 | 11 |
+-------+-------------+--------------+
| 1 | 2020-04-04 | 23 |
+-------+-------------+--------------+
| 2 | 2020-03-07 | |
+-------+-------------+--------------+
| 2 | 2020-03-26 | 19 |
+-------+-------------+--------------+
| 2 | 2020-04-01 | 6 |
+-------+-------------+--------------+
所以这就是我所做的:
import pandas as pd
df = pd.DataFrame
data = [
{
"carId": 1,
"refill_date": "2020-3-1"
},
{
"carId": 1,
"refill_date": "2020-3-12"
},
{
"carId": 1,
"refill_date": "2020-4-4"
},
{
"carId": 2,
"refill_date": "2020-3-7"
},
{
"carId": 2,
"refill_date": "2020-3-26"
},
{
"carId": 2,
"refill_date": "2020-4-1"
}
]
df = pd.DataFrame(data)
df['refill_date'] = pd.to_datetime(df['refill_date'])
for c in df['carId'].unique():
df.loc[df['carId'] == c, 'time_elapsed'] = df.loc[df['carId'] == c,
'refill_date'].diff()
返回预期结果:
+---+-------+-------------+--------------+
| | carId | refill_date | time_elapsed |
+---+-------+-------------+--------------+
| 0 | 1 | 2020-03-01 | NaT |
+---+-------+-------------+--------------+
| 1 | 1 | 2020-03-12 | 11 days |
+---+-------+-------------+--------------+
| 2 | 1 | 2020-04-04 | 23 days |
+---+-------+-------------+--------------+
| 3 | 2 | 2020-03-07 | NaT |
+---+-------+-------------+--------------+
| 4 | 2 | 2020-03-26 | 19 days |
+---+-------+-------------+--------------+
| 5 | 2 | 2020-04-01 | 6 days |
+---+-------+-------------+--------------+
所以,一切看起来都不错,但有一个问题:在我的真实实例中,我的数据框包含 350 万行,并且处理需要很长时间,即使它是一个完全数字的内存计算,“只有”1711 个组循环遍历.
有没有其他更快的方法?
谢谢!
解决方案
在 a 上使用本机 pandas 方法df.groupby
应该比“本机 python”循环显着提高性能:
df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()
这是一个小型基准测试(在我的笔记本电脑上,YMMV ...),使用 100 辆汽车,每辆汽车 31 天,性能提升近10 倍:
import pandas as pd
import timeit
data = [{"carId": carId, "refill_date": "2020-3-"+str(day)} for carId in range(1,100) for day in range(1,32)]
df = pd.DataFrame(data)
df['refill_date'] = pd.to_datetime(df['refill_date'])
def original_method():
for c in df['carId'].unique():
df.loc[df['carId'] == c, 'time_elapsed'] = df.loc[df['carId'] == c,
'refill_date'].diff()
def using_groupby():
df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()
time1 = timeit.timeit('original_method()', globals=globals(), number=100)
time2 = timeit.timeit('using_groupby()', globals=globals(), number=100)
print(time1)
print(time2)
print(time1/time2)
输出:
16.6183732
1.7910263000000022
9.278687420726307
推荐阅读
- c# - 在 Web API C# 中使用 FOR XML 从 SQL 读取 XML 结果
- excel - Excel:比较 2 个单元格值;如果 5 个或更多连续字符匹配
- c++ - 无法编译并将grpc链接到c ++程序
- sql - SQL - 从前几行减去的总和(逐行)
- r - 计算列中两个定义值之间出现的特定值的数量
- django - Django Query如何显示具有不同日期的记录计数并显示日期和计数
- node.js - 是否可以将过滤器添加到 azure blob
- android - Android 徽章编号未正确更新 android
- django - settings.py 文件的覆盖率显示为 100%
- css - Tailwind css purge 删除了所有黑暗类