首页 > 解决方案 > 当具有时间索引浮点列的 Pandas DataFrame 超过阈值超过 5 分钟时检测和计数

问题描述

我有一个包含心率(和相关列)的简单 DataFrame,由唯一且单调递增的日期时间值索引,以一分钟的间隔采样(在传感器被移除的地方有一些中断)。这是一个示例:

print(data)
                               End Time    HR  Min HR  Max HR
Start Time                                                   
2020-10-14 18:27:00 2020-10-14 18:27:59  87.0    84.0    89.0
2020-10-14 18:28:00 2020-10-14 18:28:59  89.0    85.0    94.0
2020-10-14 18:29:00 2020-10-14 18:29:59  87.0    84.0    90.0
2020-10-14 18:30:00 2020-10-14 18:30:59  91.0    87.0    96.0
2020-10-14 18:31:00 2020-10-14 18:31:59  95.0    91.0   100.0
...                                 ...   ...     ...     ...
2021-07-08 22:26:00 2021-07-08 22:26:59  73.0    70.0    76.0
2021-07-08 22:27:00 2021-07-08 22:27:59  76.0    74.0    79.0
2021-07-08 22:28:00 2021-07-08 22:28:59  71.0    70.0    74.0
2021-07-08 22:29:00 2021-07-08 22:29:59  71.0    69.0    74.0
2021-07-08 22:30:00 2021-07-08 22:30:59  74.0    72.0    78.0

[373234 rows x 4 columns]

我想检测“峰值”,它被定义为心率超过阈值 5 分钟或更长时间。所以,需要明确的是,当一个高峰持续 10 分钟时,它仍然是一个高峰,而不是两个。

当然,我可以按阈值过滤:

maybe_peaks = data[data['HR']>= threshold])]
print(maybe_peaks)

                               End Time     HR  Min HR  Max HR
Start Time                                                    
2020-10-16 12:14:00 2020-10-16 12:14:59  104.0    95.0   108.0
2020-10-16 12:15:00 2020-10-16 12:15:59  111.0   106.0   115.0
2020-10-16 12:16:00 2020-10-16 12:16:59  132.0   105.0   157.0
2020-10-16 12:17:00 2020-10-16 12:17:59  126.0   106.0   159.0
2020-10-16 12:18:00 2020-10-16 12:18:59  109.0   108.0   111.0
...                                 ...    ...     ...     ...
2021-07-04 12:58:00 2021-07-04 12:58:59  103.0    97.0   116.0
2021-07-06 13:38:00 2021-07-06 13:38:59  106.0   103.0   108.0
2021-07-06 13:39:00 2021-07-06 13:39:59  104.0   102.0   109.0
2021-07-06 17:02:00 2021-07-06 17:02:59  121.0    98.0   135.0
2021-07-07 19:58:00 2021-07-07 19:58:59  110.0   105.0   116.0

[12940 rows x 4 columns]

但是问题就变成了:如何进一步过滤掉高于阈值的时间短于 5 分钟的行。

当然,我可以做这种 C 风格的逐行循环,但我觉得这既不是最有效也不是最优雅的方式。

最终,我想在x轴上构建一个日历天数和这些峰值事件沿y的频率/计数的图。

任何提示/方向将不胜感激。

以下是一些帮助示例数据:

from pandas import Timestamp
test_dat = [[Timestamp('2021-06-25 12:00:00'), Timestamp('2021-06-25 12:00:59'), 99.0, 95.0, 105.0], [Timestamp('2021-06-25 12:01:00'), Timestamp('2021-06-25 12:01:59'), 96.0, 91.0, 102.0], [Timestamp('2021-06-25 12:02:00'), Timestamp('2021-06-25 12:02:59'), 100.0, 96.0, 105.0], [Timestamp('2021-06-25 12:03:00'), Timestamp('2021-06-25 12:03:59'), 96.0, 91.0, 100.0], [Timestamp('2021-06-25 12:04:00'), Timestamp('2021-06-25 12:04:59'), 93.0, 88.0, 102.0], [Timestamp('2021-06-25 12:05:00'), Timestamp('2021-06-25 12:05:59'), 105.0, 99.0, 110.0], [Timestamp('2021-06-25 12:06:00'), Timestamp('2021-06-25 12:06:59'), 102.0, 97.0, 109.0], [Timestamp('2021-06-25 12:07:00'), Timestamp('2021-06-25 12:07:59'), 96.0, 87.0, 102.0], [Timestamp('2021-06-25 12:08:00'), Timestamp('2021-06-25 12:08:59'), 96.0, 93.0, 101.0], [Timestamp('2021-06-25 12:09:00'), Timestamp('2021-06-25 12:09:59'), 96.0, 90.0, 106.0], [Timestamp('2021-06-25 12:10:00'), Timestamp('2021-06-25 12:10:59'), 100.0, 95.0, 110.0], [Timestamp('2021-06-25 12:11:00'), Timestamp('2021-06-25 12:11:59'), 100.0, 95.0, 113.0], [Timestamp('2021-06-25 12:12:00'), Timestamp('2021-06-25 12:12:59'), 98.0, 91.0, 103.0], [Timestamp('2021-06-25 12:13:00'), Timestamp('2021-06-25 12:13:59'), 101.0, 97.0, 108.0], [Timestamp('2021-06-25 12:14:00'), Timestamp('2021-06-25 12:14:59'), 98.0, 91.0, 102.0], [Timestamp('2021-06-25 12:15:00'), Timestamp('2021-06-25 12:15:59'), 100.0, 93.0, 110.0], [Timestamp('2021-06-25 12:16:00'), Timestamp('2021-06-25 12:16:59'), 96.0, 89.0, 104.0], [Timestamp('2021-06-25 12:17:00'), Timestamp('2021-06-25 12:17:59'), 98.0, 95.0, 104.0], [Timestamp('2021-06-25 12:18:00'), Timestamp('2021-06-25 12:18:59'), 95.0, 93.0, 99.0], [Timestamp('2021-06-25 12:19:00'), Timestamp('2021-06-25 12:19:59'), 94.0, 84.0, 104.0], [Timestamp('2021-06-25 12:20:00'), Timestamp('2021-06-25 12:20:59'), 94.0, 90.0, 99.0], [Timestamp('2021-06-25 12:21:00'), Timestamp('2021-06-25 12:21:59'), 98.0, 95.0, 100.0], [Timestamp('2021-06-25 12:22:00'), Timestamp('2021-06-25 12:22:59'), 98.0, 97.0, 102.0], [Timestamp('2021-06-25 12:23:00'), Timestamp('2021-06-25 12:23:59'), 98.0, 96.0, 102.0], [Timestamp('2021-06-25 12:24:00'), Timestamp('2021-06-25 12:24:59'), 98.0, 96.0, 100.0], [Timestamp('2021-06-25 12:25:00'), Timestamp('2021-06-25 12:25:59'), 96.0, 95.0, 100.0], [Timestamp('2021-06-25 12:26:00'), Timestamp('2021-06-25 12:26:59'), 102.0, 98.0, 105.0], [Timestamp('2021-06-25 12:27:00'), Timestamp('2021-06-25 12:27:59'), 97.0, 92.0, 103.0], [Timestamp('2021-06-25 12:28:00'), Timestamp('2021-06-25 12:28:59'), 92.0, 87.0, 99.0], [Timestamp('2021-06-25 12:29:00'), Timestamp('2021-06-25 12:29:59'), 96.0, 94.0, 99.0], [Timestamp('2021-06-25 12:30:00'), Timestamp('2021-06-25 12:30:59'), 97.0, 93.0, 100.0], [Timestamp('2021-06-25 12:31:00'), Timestamp('2021-06-25 12:31:59'), 101.0, 97.0, 103.0], [Timestamp('2021-06-25 12:32:00'), Timestamp('2021-06-25 12:32:59'), 99.0, 95.0, 103.0], [Timestamp('2021-06-25 12:33:00'), Timestamp('2021-06-25 12:33:59'), 101.0, 93.0, 105.0], [Timestamp('2021-06-25 12:34:00'), Timestamp('2021-06-25 12:34:59'), 98.0, 96.0, 101.0], [Timestamp('2021-06-25 12:35:00'), Timestamp('2021-06-25 12:35:59'), 100.0, 93.0, 105.0], [Timestamp('2021-06-25 12:36:00'), Timestamp('2021-06-25 12:36:59'), 103.0, 101.0, 108.0], [Timestamp('2021-06-25 12:37:00'), Timestamp('2021-06-25 12:37:59'), 105.0, 101.0, 111.0], [Timestamp('2021-06-25 12:38:00'), Timestamp('2021-06-25 12:38:59'), 106.0, 103.0, 114.0], [Timestamp('2021-06-25 12:39:00'), Timestamp('2021-06-25 12:39:59'), 107.0, 104.0, 109.0], [Timestamp('2021-06-25 12:40:00'), Timestamp('2021-06-25 12:40:59'), 101.0, 95.0, 109.0], [Timestamp('2021-06-25 12:41:00'), Timestamp('2021-06-25 12:41:59'), 99.0, 96.0, 103.0], [Timestamp('2021-06-25 12:42:00'), Timestamp('2021-06-25 12:42:59'), 99.0, 96.0, 105.0], [Timestamp('2021-06-25 12:43:00'), Timestamp('2021-06-25 12:43:59'), 96.0, 95.0, 98.0], [Timestamp('2021-06-25 12:44:00'), Timestamp('2021-06-25 12:44:59'), 96.0, 94.0, 99.0], [Timestamp('2021-06-25 12:45:00'), Timestamp('2021-06-25 12:45:59'), 102.0, 96.0, 110.0], [Timestamp('2021-06-25 12:46:00'), Timestamp('2021-06-25 12:46:59'), 105.0, 102.0, 109.0], [Timestamp('2021-06-25 12:47:00'), Timestamp('2021-06-25 12:47:59'), 104.0, 100.0, 108.0], [Timestamp('2021-06-25 12:48:00'), Timestamp('2021-06-25 12:48:59'), 100.0, 98.0, 103.0], [Timestamp('2021-06-25 12:49:00'), Timestamp('2021-06-25 12:49:59'), 103.0, 99.0, 110.0], [Timestamp('2021-06-25 12:50:00'), Timestamp('2021-06-25 12:50:59'), 106.0, 99.0, 111.0], [Timestamp('2021-06-25 12:51:00'), Timestamp('2021-06-25 12:51:59'), 100.0, 95.0, 104.0], [Timestamp('2021-06-25 12:52:00'), Timestamp('2021-06-25 12:52:59'), 108.0, 102.0, 113.0], [Timestamp('2021-06-25 12:53:00'), Timestamp('2021-06-25 12:53:59'), 113.0, 106.0, 116.0], [Timestamp('2021-06-25 12:54:00'), Timestamp('2021-06-25 12:54:59'), 109.0, 105.0, 113.0], [Timestamp('2021-06-25 12:55:00'), Timestamp('2021-06-25 12:55:59'), 103.0, 101.0, 110.0], [Timestamp('2021-06-25 12:56:00'), Timestamp('2021-06-25 12:56:59'), 104.0, 94.0, 109.0], [Timestamp('2021-06-25 12:57:00'), Timestamp('2021-06-25 12:57:59'), 93.0, 82.0, 107.0], [Timestamp('2021-06-25 12:58:00'), Timestamp('2021-06-25 12:58:59'), 99.0, 94.0, 104.0], [Timestamp('2021-06-25 12:59:00'), Timestamp('2021-06-25 12:59:59'), 98.0, 92.0, 103.0], [Timestamp('2021-06-25 13:00:00'), Timestamp('2021-06-25 13:00:59'), 98.0, 95.0, 102.0]]
df = pd.DataFrame(test_dat, columns=['Start Time', 'End Time', 'HR', 'Min HR', 'Max HR']).set_index('Start Time')

编辑:没关系,但我使用的阈值是 103 BPM。(这是从统计量计算出来的,而不是一个幻数。)

标签: python-3.xpandasdataframedatetime

解决方案


让我们分步进行

df = df.asfreq('1T').reset_index()

th = 96
th_mask = df['HR'].gt(th)
streaks = (~th_mask).cumsum() # continuous streaks

peaks = df[th_mask].groupby(streaks)['Start Time']\
                   .agg(start_time='first', streak_count='count')

peaks = peaks.query('streak_count >= 5')  # hr > th for at least 5 min
peaks_freq_per_day = peaks.groupby(peaks['start_time'].dt.date)['streak_count'].count()

解释

  • Reindex频率数据帧1 min,以便在没有可用传感器数据时考虑行。
  • 创建一个布尔值th_mask以识别心率大于给定阈值的行th
>>> th_mask

0      True
1     False
2      True
3     False
4     False
      ...  
56     True
57    False
58     True
59     True
60     True
Name: HR, Length: 61, dtype: bool
  • 使用cumsumonth_mask识别心率保持在阈值以上的连续条纹th
>>> streaks

0      0
1      1
2      1
3      2
4      3
      ..
56    15
57    16
58    16
59    16
60    16
Name: HR, Length: 61, dtype: int64
  • 对连续条纹上的数据框进行分组,并Start Time使用first和聚合count
>>> peaks
            start_time  streak_count
HR                                
0  2021-06-25 12:00:00           1
1  2021-06-25 12:02:00           1
3  2021-06-25 12:05:00           2
6  2021-06-25 12:10:00           6
7  2021-06-25 12:17:00           1
10 2021-06-25 12:21:00           4
11 2021-06-25 12:26:00           2
13 2021-06-25 12:30:00          13
15 2021-06-25 12:45:00          12
16 2021-06-25 12:58:00           3
  • 过滤peaks以仅选择心率保持在阈值以上至少持续时间的行5min
>>> peaks
            start_time  streak_count
HR                                
6  2021-06-25 12:10:00           6
13 2021-06-25 12:30:00          13
15 2021-06-25 12:45:00          12
  • peaks现在我们可以通过对每日频率进行分组并使用聚合来确定每个日历日的峰值频率count
>>> peaks_freq_per_day

start_time
2021-06-25    3
Name: peak_count, dtype: int64

推荐阅读