首页 > 解决方案 > 在匹配查询的数据框中查找最后一次出现的连续(时间)行

问题描述

给定以下数据框:

约会时间 Probe_loc 温度 湿度
0 2021-04-08 12:30:00 KY_North 65 40
1 2021-04-08 12:31:00 KY_North 65 40
2 2021-04-08 12:32:00 KY_North 65 40
3 2021-04-08 12:33:00 KY_North 65 40
4 2021-04-08 12:30:00 KY_South 65 41
5 2021-04-08 12:31:00 KY_South 66 39
6 2021-04-08 12:32:00 KY_South 66 39
7 2021-04-08 12:33:00 KY_South 66 39
8 2021-04-08 12:34:00 KY_South 68 40
9 2021-04-08 12:40:00 KY_South 65 41
10 2021-04-08 12:41:00 KY_South 66 39
11 2021-04-08 12:42:00 KY_South 66 39
12 2021-04-08 12:43:00 KY_South 66 39
13 2021-04-08 12:44:00 KY_South 68 40

我可以轻松地对数据框进行一些查询以返回符合某些条件的行

例如,假设我想查看温度小于 67 且湿度 <=40 的所有条目,我可以这样做:

qr = probe_df[(probe_df['Temp'] <67) & (probe_df['Humidity'] <=40)][['Date_time','Probe_loc', 'Temp', 'Humidity']]
print(qr)

我得到以下信息:

Date_time           Probe_loc Temp Humidity
2021-04-08 12:30:00  KY_North  65 40  
2021-04-08 12:31:00  KY_North  65 40  
2021-04-08 12:32:00  KY_North  65 40
2021-04-08 12:33:00  KY_North  65 40
2021-04-08 12:31:00  KY_South  66 39
2021-04-08 12:32:00  KY_South  66 39
2021-04-08 12:33:00  KY_South  66 39
2021-04-08 12:41:00  KY_South  66 39
2021-04-08 12:42:00  KY_South  66 39
2021-04-08 12:43:00  KY_South  66 39

我想知道怎么做,是从连续的时间条目中只选择最后一个匹配值,对于每个探测。所以在 12:33:00 和 12:43:00 的条目。

预期输出为:

Date_time           Probe_loc Temp Humidity
2021-04-08 12:33:00  KY_North  65 40
2021-04-08 12:33:00  KY_South  66 39
2021-04-08 12:43:00  KY_South  66 39

标签: pythonpandas

解决方案


输入数据:

>>> probe_df
             Date_time Probe_loc  Temp  Humidity
0  2021-04-08 12:30:00  KY_North    65        40  # group 1
1  2021-04-08 12:31:00  KY_North    65        40  # group 1
2  2021-04-08 12:32:00  KY_North    65        40  # group 1
3  2021-04-08 12:33:00  KY_North    65        40  # group 1 <- match
4  2021-04-08 12:33:00  KY_North    67        39  # group 1

5  2021-04-08 12:30:00  KY_South    65        41  # group 2
6  2021-04-08 12:31:00  KY_South    66        39  # group 2
7  2021-04-08 12:32:00  KY_South    66        39  # group 2
8  2021-04-08 12:33:00  KY_South    66        39  # group 2 <- match
9  2021-04-08 12:33:00  KY_South    68        40  # group 2

10 2021-04-08 12:40:00  KY_South    65        41  # group 3
11 2021-04-08 12:41:00  KY_South    66        39  # group 3
12 2021-04-08 12:42:00  KY_South    66        39  # group 3
13 2021-04-08 12:43:00  KY_South    66        39  # group 3 <- match
14 2021-04-08 12:44:00  KY_South    68        40  # group 3
qr = (probe_df["Temp"] < 67) & (probe_df["Humidity"] <= 40)

data = []
# for each location
for _, df in probe_df.groupby("Probe_loc"):
    # create groups of consecutive time
    groups = (df["Date_time"].diff(1) > pd.Timedelta("1T")).cumsum()
    # keep the most recent value that match your filter
    df = df.groupby(groups, as_index=False).apply(lambda x: x[qr[x.index]].iloc[-1])
    data.append(df)

输出结果:

>>> pd.concat(data).set_index(["Probe_loc", "Date_time"]).rename_axis(columns="Measurement")
Measurement                    Temp  Humidity
Probe_loc Date_time
KY_North  2021-04-08 12:33:00    65        40
KY_South  2021-04-08 12:33:00    66        39
          2021-04-08 12:43:00    66        39

推荐阅读