首页 > 解决方案 > 计算熊猫行组之间时间差的快速有效方法?

问题描述

假设我在 DataFrame 中有这张表,其中包含几辆汽车已重新装满的日期:

+-------+-------------+
| carId | refill_date |
+-------+-------------+
|     1 |  2020-03-01 |
+-------+-------------+
|     1 |  2020-03-12 |
+-------+-------------+
|     1 |  2020-04-04 |
+-------+-------------+
|     2 |  2020-03-07 |
+-------+-------------+
|     2 |  2020-03-26 |
+-------+-------------+
|     2 |  2020-04-01 |
+-------+-------------+

我想添加一个第三列,time_elapsed,每次补充之间的持续时间。

+-------+-------------+--------------+
| carId | refill_date | time_elapsed |
+-------+-------------+--------------+
|     1 |  2020-03-01 |              |
+-------+-------------+--------------+
|     1 |  2020-03-12 |           11 |
+-------+-------------+--------------+
|     1 |  2020-04-04 |           23 |
+-------+-------------+--------------+
|     2 |  2020-03-07 |              |
+-------+-------------+--------------+
|     2 |  2020-03-26 |           19 |
+-------+-------------+--------------+
|     2 |  2020-04-01 |            6 |
+-------+-------------+--------------+

所以这就是我所做的:

import pandas as pd
df = pd.DataFrame

data = [
    {
        "carId": 1,
        "refill_date": "2020-3-1"
    },
    {
        "carId": 1,
        "refill_date": "2020-3-12"
    },
    {
        "carId": 1,
        "refill_date": "2020-4-4"
    },
    {
        "carId": 2,
        "refill_date": "2020-3-7"
    },
    {
        "carId": 2,
        "refill_date": "2020-3-26"
    },
    {
        "carId": 2,
        "refill_date": "2020-4-1"
    }
]

df = pd.DataFrame(data)

df['refill_date'] = pd.to_datetime(df['refill_date'])

for c in df['carId'].unique():
    df.loc[df['carId'] == c, 'time_elapsed'] = df.loc[df['carId'] == c,
                                                      'refill_date'].diff()

返回预期结果:

+---+-------+-------------+--------------+
|   | carId | refill_date | time_elapsed |
+---+-------+-------------+--------------+
| 0 |     1 |  2020-03-01 |          NaT |
+---+-------+-------------+--------------+
| 1 |     1 |  2020-03-12 |      11 days |
+---+-------+-------------+--------------+
| 2 |     1 |  2020-04-04 |      23 days |
+---+-------+-------------+--------------+
| 3 |     2 |  2020-03-07 |          NaT |
+---+-------+-------------+--------------+
| 4 |     2 |  2020-03-26 |      19 days |
+---+-------+-------------+--------------+
| 5 |     2 |  2020-04-01 |       6 days |
+---+-------+-------------+--------------+

所以,一切看起来都不错,但有一个问题:在我的真实实例中,我的数据框包含 350 万行,并且处理需要很长时间,即使它是一个完全数字的内存计算,“只有”1711 个组循环遍历.

有没有其他更快的方法?

谢谢!

标签: pythonpandasdataframe

解决方案


在 a 上使用本机 pandas 方法df.groupby应该比“本机 python”循环显着提高性能:

df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()

这是一个小型基准测试(在我的笔记本电脑上,YMMV ...),使用 100 辆汽车,每辆汽车 31 天,性能提升近10 倍:

import pandas as pd
import timeit

data = [{"carId": carId, "refill_date": "2020-3-"+str(day)} for carId in range(1,100) for day in range(1,32)]
df = pd.DataFrame(data)
df['refill_date'] = pd.to_datetime(df['refill_date'])

def original_method():
    for c in df['carId'].unique():
        df.loc[df['carId'] == c, 'time_elapsed'] = df.loc[df['carId'] == c,
                                                          'refill_date'].diff()

def using_groupby():
    df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()

time1 = timeit.timeit('original_method()', globals=globals(), number=100)
time2 = timeit.timeit('using_groupby()', globals=globals(), number=100)

print(time1)
print(time2)
print(time1/time2)

输出:

16.6183732
1.7910263000000022
9.278687420726307

推荐阅读