首页 > 解决方案 > 通过 Python 清理时间序列数据

问题描述

我有一个带有船码 (MMSI)、时间的原始数据集。由于大量的原始数据,现在我想通过船码(MMSI)删除数据,时间步长超过 10 分钟。例如:

在此处输入图像描述

对此:
在此处输入图像描述

我试图计算行之间的时间间隔,然后使用“for”和“if”,但这似乎很复杂。我是编码新手。这些是我到目前为止已经完成的:

df['diff'] = df.sort_values(['MMSI','TIME']).groupby('MMSI')['TIME'].diff()
df = df.dropna(subset=['diff'])
for i in ship_list:
    df2 = df.loc[df['MMSI'] == i]
    total = 0
    if (total< 10)
        total = df['diff'].iloc() #stuck here

标签: pythonfor-loopdiffdata-cleaningremove-if

解决方案


freq='10T'解决问题的可能方法 - 每 10 分钟间隔(意味着 10 分钟)取一行。

import pandas as pd

# sample data
import numpy as np
df = pd.DataFrame({
    'shipcode': [1] * 5 + [2] * 5 + [1] * 5 + [2] * 5,
    'time': np.array([1.01e18 + 8.3e10 * val for val in range(20)]).astype('datetime64')})

# print(df)

# floor 'time' values with a step of 10 minutes
df['floored_time'] = df['time'].dt.floor(freq='10T')

# groupby to remove duplicates of 'shipcode', 'floored_time' pairs
dfg = df.groupby(['shipcode', 'floored_time']).agg({'time': 'min'})

# construct resulting table
dfg['shipcode'] = dfg.index.get_level_values(0)
ans = dfg[['shipcode', 'time']].sort_values(by='time').reset_index(drop=True)

# print(ans)

样本数据:

    shipcode                time
0          1 2002-01-02 19:33:20
1          1 2002-01-02 19:34:43
2          1 2002-01-02 19:36:06
3          1 2002-01-02 19:37:29
4          1 2002-01-02 19:38:52
5          2 2002-01-02 19:40:15
6          2 2002-01-02 19:41:38
7          2 2002-01-02 19:43:01
8          2 2002-01-02 19:44:24
9          2 2002-01-02 19:45:47
10         1 2002-01-02 19:47:10
11         1 2002-01-02 19:48:33
12         1 2002-01-02 19:49:56
13         1 2002-01-02 19:51:19
14         1 2002-01-02 19:52:42
15         2 2002-01-02 19:54:05
16         2 2002-01-02 19:55:28
17         2 2002-01-02 19:56:51
18         2 2002-01-02 19:58:14
19         2 2002-01-02 19:59:37

结果:

   shipcode                time
0         1 2002-01-02 19:33:20
1         2 2002-01-02 19:40:15
2         1 2002-01-02 19:47:10
3         1 2002-01-02 19:51:19
4         2 2002-01-02 19:54:05

推荐阅读