首页 > 解决方案 > How could I get the top X records per distinct column within each range of Y timestamp, for some other column Z?

问题描述

So far, I'm using a lot of small queries in a loop to achieve this. I'm hoping there is a way to concentrate this into a single query since these queries are starting to take hours to finish (millions of rows), with some expected tests requiring twenty times that amount of data. Is there a way to write a query that gets X number of records of distinct device_id (a column) for each time step Y for test_id Z?

With some example data:

| ts                  | test_id | device_id | data     |
| 2018-06-25 06:00:00 | 0       | 1         | "blah00" |
| 2018-06-25 08:00:00 | 1       | 1         | "blah01" |
| 2018-06-25 08:00:00 | 1       | 2         | "blah02" |
| 2018-06-25 08:00:02 | 1       | 1         | "blah03" |
| 2018-06-25 08:00:02 | 1       | 2         | "blah04" |
| 2018-06-25 08:00:05 | 1       | 1         | "blah05" |
| 2018-06-25 08:00:05 | 1       | 2         | "blah06" |
| 2018-06-25 08:00:08 | 1       | 1         | "blah07" |
| 2018-06-25 08:00:08 | 1       | 2         | "blah08" |
| 2018-06-25 08:00:10 | 1       | 1         | "blah09" |
| 2018-06-25 08:00:10 | 1       | 2         | "blah10" |
| 2018-06-25 08:00:12 | 1       | 1         | "blah11" |
| 2018-06-25 08:00:12 | 1       | 2         | "blah12" |
| 2018-06-25 08:00:15 | 1       | 1         | "blah13" |
| 2018-06-25 08:00:18 | 1       | 1         | "blah14" |
| 2018-06-25 08:00:20 | 1       | 1         | "blah15" |
| 2018-06-25 08:00:20 | 1       | 2         | "blah16" |

And I wanted the top 3 records for every 10 seconds for test_id 1, I'd like to get the result:

| ts                  | test_id | device_id | data     |
| 2018-06-25 08:00:00 | 1       | 1         | "blah01" |
| 2018-06-25 08:00:00 | 1       | 2         | "blah02" |
| 2018-06-25 08:00:02 | 1       | 1         | "blah03" |
| 2018-06-25 08:00:02 | 1       | 2         | "blah04" |
| 2018-06-25 08:00:05 | 1       | 1         | "blah05" |
| 2018-06-25 08:00:05 | 1       | 2         | "blah06" |
| 2018-06-25 08:00:10 | 1       | 1         | "blah09" |
| 2018-06-25 08:00:10 | 1       | 2         | "blah10" |
| 2018-06-25 08:00:12 | 1       | 1         | "blah11" |
| 2018-06-25 08:00:12 | 1       | 2         | "blah12" |
| 2018-06-25 08:00:15 | 1       | 1         | "blah13" |
| 2018-06-25 08:00:20 | 1       | 1         | "blah15" |
| 2018-06-25 08:00:20 | 1       | 2         | "blah16" |

A few things that can't be taken for granted is that a device might fail to record for some time (thus I can't guarantee each device would have the same number of rows per time frame (as I attempted to replicate in the sample data) including having all of the devices paused for some time (thus it could be possible for to have no data for one or more consecutive time frames).

My current queries (and surrounding pseudo-code) are - angle brackets indicate some value that would be set to the applicable value:

For each distinct device_id in test_id Z


SELECT ts FROM data_log
WHERE (test_id=<Z> AND device_id=<device_id>)
ORDER BY ts 
LIMIT 1

Store as newest

SELECT ts FROM data_log
WHERE (test_id=<Z> AND device_id=<device_id>)
ORDER BY ts 
LIMIT 1

Store as oldest

For currentTime = newest; currentTime < oldest; currentTime += timestep Y

SELECT * FROM data_log
WHERE (test_id=<Z> AND device_id=<device_id> AND ts>=<currentTime>)
ORDER BY ts
LIMIT <X>

标签: mysql

解决方案


首先,确保您在 (test_id, ts) 上有一个键 - 按该顺序排列的列,很重要。然后你可以做

select * from t where test_id = 1 order by ts

并在客户端处理输出,过滤掉您不想要的记录。如果您的每个测试 ID 的记录少于 10 M,这应该会给您带来不错的性能,但性能并不出色。

如果你擅长 C++,并且想要获得更好的性能,你可以编写一个UDF。您需要为此做一些技巧。使用initid->ptr分配内存将其用于缓冲区以保持状态。然后在该内存区域中,您可以记住有关先前看到的行的信息,这将允许您决定是否应该在结果中包含当前行。您的查询将是这样的:

select * from t where test_id = 1 and should_include_test(ts,device_id) order by ts


推荐阅读