sql - In SQL how do I find the first record per-user if it's within a time slice, without scanning the entire DB
问题描述
I've got a database, user_requests
that basically looks like this:
user_id | request_timestamp | request_type | other_metadata
-----------|-------------------------|--------------|----------------
user1 | 2018-11-01:04:04:41 | type1 | opaquedata_A
user2 | 2018-11-01:04:03:41 | type2 | opaquedata_B
user1 | 2018-11-01:04:01:41 | type1 | opaquedata_C
user3 | 2018-11-01:04:05:41 | type3 | opaquedata_D
user4 | 2018-11-01:04:01:41 | type4 | opaquedata_E
And it is huge. Doing any operation over the entire thing is absolutely untenable, everything needs to be like "which queries were most common this month" no one ever checks it overall.
What I'm trying to do is some analysis on the first requests for several user. I absolutely do not need the first requests of every user or over all-time, as long as it's a representative sample.
However I'm running into a problem where all my usual attempts to restrict this are finding "the first request within bounds" not "the first request if it's within bounds"
SELECT DISTINCT user_id,
first_value(request_type) over (PARTITION BY user_id ORDER BY request_timestamp
rows BETWEEN unbounded preceding and unbounded following) requestType,
first_value(other_metadata) over (PARTITION BY user_id ORDER BY request_timestamp
rows BETWEEN unbounded preceding and unbounded following) otherMetadata,
first_value(request_timestamp) over (PARTITION BY user_id ORDER BY request_timestamp
rows BETWEEN unbounded preceding and unbounded following) utteranceTimestamp
FROM user_requests
WHERE request_timestamp BETWEEN '2018-11-01' AND request_timestamp < '2018-12-01'
Like this finds the earliest request from a user in November, when what I want is the earliest request from a user overall if that request is in November.
Any idea how I can get what I want while still writing queries that don't take hours to complete?
解决方案
您想要一个调整后的另一种每组最大 n查询的形式:
SELECT Curr.user_id, Curr.request_type, Curr.other_metadata, Curr.request_timestamp
FROM User_Requests Curr
WHERE Curr.request_timestamp >='2018-11-01'
AND Curr.request_timestamp < '2018-12-01'
AND NOT EXISTS (SELECT 1
FROM User_Requests Prev
WHERE Prev.user_id = Curr.user_id
AND Prev.request_timestamp < Curr.request_timestamp)
...这会在给定的时间范围内找到所有请求,然后如果有任何更早的请求(在当月或其他时间),则丢弃任何请求。这不仅获得了最早的月份,而且如果还有其他先前的查询,也会产生忽略所需时间范围内的请求的效果。
为获得最佳结果,您需要在(user_id, request_timestamp)
.
(请注意,我假设优化器很好,并将您的日期转换为适当的类型以进行范围搜索。您可能想要验证它request_timestamp
没有被强制转换。)
奖金LEFT JOIN
排除形式,以防它表现更好。
SELECT Curr.user_id, Curr.request_type, Curr.other_metadata, Curr.request_timestamp
FROM User_Requests Curr
LEFT JOIN User_Requests Prev
ON Prev.user_id = Curr.user_id
AND Prev.request_timestamp < Curr.request_timestamp
WHERE Curr.request_timestamp >='2018-11-01'
AND Curr.request_timestamp < '2018-12-01'
AND Prev.user_id IS NULL
推荐阅读
- android - 片段中的多个音频
- gams-math - 如何分配不等概率?
- java - 将数据写入 Firebase
- ruby-on-rails - 如何将其保存到数据库中?
- python - 更改 json 文件格式
- python - 即使在 TKinter(python)中按下输入键,如何生成制表键?
- reactjs - 当render中已经有return语句时,你如何处理“你的render方法应该有return语句”错误?
- python - 如何将图像放在python中的链表中
- python - Python 的 Decimal 模块中的乘法精度
- android - 将 Firebase 数据从 Android 传递到 Unity3D