首页 > 解决方案 > 如何根据python中df1中的日期时间数据从df2中提取数据

问题描述

这里的问题是我正在使用 2 个数据集:df1拥有几年的所有时间日志数据;df2具有在特定时间点发生的事件的数据。我需要在 in 的日期时间之前从一段一周的数据中提取(因此df1特定7天期间的结束时间)。IncidentTimedf2IncidentTimeGroup_Id

注意:AGroup_Id可能有超过 1 个事件(即df2,同一行中可能有多行Group_Id)。

以下是数据集:

df1

      Timestamp           Group_Id      Data
2013-10-20 00:00:05.143    11           14
2013-10-21 00:05:10.377    11           15
2013-10-22 14:22:15.501    11           19
                   ...
2016-03-05 00:00:05.743    101          21
2017-12-24 00:00:10.407    101          33
                   ...

df2

IncidentTime       Group_Id
27/10/13 16:08      11
03/12/16 16:11      2
24/10/14 12:08      11
04/07/17 08:00      100
03/04/13 14:10      26
15/11/18 17:00      46
11/02/19 00:20      101

然后在最终数据中创建一个新列IncidentTimeGroup_Id每次都相同)。例如,Group_Id = 11has IncidentTimeof 27/10/13 16:08,因此我们从它位于和之间df1的位置获取所有数据。这样我们的最终数据就拥有了对应于它自己的所有7 天数据:Group_Id = 11Timestamp20/10/13 16:0827/10/13 16:08df3IncidentTimeGroup_Id

df3

IncidentTime       Group_Id       Timestamp             Data
27/10/13 16:08      11        2013-10-20 19:10:05.143    14
27/10/13 16:08      11        2013-10-21 00:05:10.377    15
27/10/13 16:08      11        2013-10-22 14:22:15.501    19
                                  ...
27/10/13 16:08      11        2013-10-27 05:22:15.501    20
                                  ...

标签: pythonpandas

解决方案


您可以先将列转换为日期时间,然后IncidentTime_start通过减去 7 天并DataFrame.merge使用外部连接创建列:

df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df2['IncidentTime'] = pd.to_datetime(df2['IncidentTime'], dayfirst=True)
df2['IncidentTime_start'] = df2['IncidentTime'] - pd.offsets.DateOffset(days=7)

df = df2.merge(df1, on='Group_Id', how='outer')

print (df)
         IncidentTime  Group_Id  IncidentTime_start               Timestamp  \
0 2013-10-27 16:08:00        11 2013-10-20 16:08:00 2013-10-20 00:00:05.143   
1 2013-10-27 16:08:00        11 2013-10-20 16:08:00 2013-10-21 00:05:10.377   
2 2013-10-27 16:08:00        11 2013-10-20 16:08:00 2013-10-22 14:22:15.501   
3 2016-12-03 16:11:00         2 2016-11-26 16:11:00                     NaT   
4 2014-10-24 12:08:00        40 2014-10-17 12:08:00                     NaT   
5 2017-07-04 08:00:00       100 2017-06-27 08:00:00                     NaT   
6 2013-04-03 14:10:00        26 2013-03-27 14:10:00                     NaT   
7 2018-11-15 17:00:00        46 2018-11-08 17:00:00                     NaT   
8 2019-02-11 00:20:00       101 2019-02-04 00:20:00 2016-03-05 00:00:05.743   
9 2019-02-11 00:20:00       101 2019-02-04 00:20:00 2017-12-24 00:00:10.407   

    Data  
0   14.0  
1   15.0  
2   19.0  
3    NaN  
4    NaN  
5    NaN  
6    NaN  
7    NaN  
8   21.0  
9  331.0  

Series.between然后通过with过滤boolean indexing

df = (df[df['Timestamp'].between(df['IncidentTime_start'], df['IncidentTime'])]
                                                        .drop('IncidentTime_start', axis=1))
print (df)
         IncidentTime  Group_Id               Timestamp  Data
1 2013-10-27 16:08:00        11 2013-10-21 00:05:10.377  15.0
2 2013-10-27 16:08:00        11 2013-10-22 14:22:15.501  19.0

推荐阅读