首页 > 解决方案 > 当有重复的时间戳时如何在熊猫中过滤数据框的时间戳

问题描述

我有一个看起来像这样的 df,其中每一行都是以 10 秒为间隔的观察。

timestamp, record, precip
5/14/2020 12:01,1,0.002
5/14/2020 12:01,2,0.002
5/14/2020 12:01,3,0.002
5/14/2020 12:01,4,0.002
5/14/2020 12:01,5,0.002
5/14/2020 12:01,6,0.002
5/14/2020 12:02,7,0.002
5/14/2020 12:02,8,0.002
5/14/2020 12:02,9,0.002
5/14/2020 12:02,10,0.002
5/14/2020 12:02,11,0.002
5/14/2020 12:02,12,0.002

我原以为文件也会标记秒数,但这似乎是我下次在现场时必须解决的问题。那么与此同时,在一刻钟标记(12:00、12:15、12:30、12:45)过滤掉观察结果的最佳方法是什么?我的第一个想法可能是使用“记录”列并使用每个刻钟标记的最低记录值?

以上面的 df 为例,如果我想过滤掉 12:02,我会使用该行

5/14/2020 12:02,7,0.002

因为“7”是标记为 12:02 的所有行的最低记录值。

所需的 df 看起来像这样,但 15 分钟间隔而不是 1 分钟间隔。

5/14/2020 12:01,1,0.002
5/14/2020 12:02,7,0.002
5/14/2020 12:03,13,0.002
5/14/2020 12:04,19,0.002
5/14/2020 12:05,25,0.002
5/14/2020 12:06,31,0.002
5/14/2020 12:07,37,0.002

编辑:原来我在文件中阅读的方式是我的问题。我没有意识到我可以像 .csv 文件一样读取 .dat 文件。如果我将文件作为 .dat 文件读取,它会为时间戳保留 HH:MM:SS 格式。既然是这种情况,我用

df.resample('900s', on 'Date Time').first()

它完美地给了我df

2020-05-14 11:15:00 2020-05-14 11:20:10       1             0.002
2020-05-14 11:30:00 2020-05-14 11:30:00      60             0.002
2020-05-14 11:45:00 2020-05-14 11:45:00     150             0.002
2020-05-14 12:00:00 2020-05-14 12:00:00     240             0.002
2020-05-14 12:15:00 2020-05-14 12:15:00     330             0.002

标签: pythonpandasdataframe

解决方案


IIUC,使用drop_duplicates

df.sort_values('record').drop_duplicates(subset='timestamp')

输出:

         timestamp  record  precip
0  5/14/2020 12:01       1   0.002
6  5/14/2020 12:02       7   0.002

推荐阅读