首页 > 解决方案 > In SQL how do I find the first record per-user if it's within a time slice, without scanning the entire DB

问题描述

I've got a database, user_requests that basically looks like this:

  user_id  |    request_timestamp    | request_type | other_metadata
-----------|-------------------------|--------------|----------------
  user1    |    2018-11-01:04:04:41  |    type1     | opaquedata_A
  user2    |    2018-11-01:04:03:41  |    type2     | opaquedata_B
  user1    |    2018-11-01:04:01:41  |    type1     | opaquedata_C
  user3    |    2018-11-01:04:05:41  |    type3     | opaquedata_D
  user4    |    2018-11-01:04:01:41  |    type4     | opaquedata_E

And it is huge. Doing any operation over the entire thing is absolutely untenable, everything needs to be like "which queries were most common this month" no one ever checks it overall.

What I'm trying to do is some analysis on the first requests for several user. I absolutely do not need the first requests of every user or over all-time, as long as it's a representative sample.

However I'm running into a problem where all my usual attempts to restrict this are finding "the first request within bounds" not "the first request if it's within bounds"

SELECT DISTINCT user_id,
              first_value(request_type) over (PARTITION BY user_id ORDER BY request_timestamp
                rows BETWEEN unbounded preceding and unbounded following) requestType,
              first_value(other_metadata) over (PARTITION BY user_id ORDER BY request_timestamp
                rows BETWEEN unbounded preceding and unbounded following) otherMetadata,
              first_value(request_timestamp) over (PARTITION BY user_id ORDER BY request_timestamp
                rows BETWEEN unbounded preceding and unbounded following) utteranceTimestamp
FROM user_requests
WHERE request_timestamp BETWEEN '2018-11-01' AND request_timestamp < '2018-12-01'

Like this finds the earliest request from a user in November, when what I want is the earliest request from a user overall if that request is in November.

Any idea how I can get what I want while still writing queries that don't take hours to complete?

标签: sqlamazon-redshift

解决方案


您想要一个调整后的另一种查询的形式:

SELECT Curr.user_id, Curr.request_type, Curr.other_metadata, Curr.request_timestamp
FROM User_Requests Curr
WHERE  Curr.request_timestamp >='2018-11-01' 
       AND Curr.request_timestamp < '2018-12-01'
       AND NOT EXISTS (SELECT 1
                       FROM User_Requests Prev
                       WHERE Prev.user_id = Curr.user_id
                             AND Prev.request_timestamp < Curr.request_timestamp)

...这会在给定的时间范围内找到所有请求,然后如果有任何更早的请求(在当月或其他时间),则丢弃任何请求。这不仅获得了最早的月份,而且如果还有其他先前的查询,也会产生忽略所需时间范围内的请求的效果。

为获得最佳结果,您需要在(user_id, request_timestamp).
(请注意,我假设优化器很好,并将您的日期转换为适当的类型以进行范围搜索。您可能想要验证它request_timestamp没有被强制转换。)


奖金LEFT JOIN排除形式,以防它表现更好。

SELECT Curr.user_id, Curr.request_type, Curr.other_metadata, Curr.request_timestamp
FROM User_Requests Curr
LEFT JOIN User_Requests Prev
       ON Prev.user_id = Curr.user_id
          AND Prev.request_timestamp < Curr.request_timestamp
WHERE  Curr.request_timestamp >='2018-11-01' 
       AND Curr.request_timestamp < '2018-12-01'
       AND Prev.user_id IS NULL

推荐阅读