首页 > 解决方案 > 如何在 SQL 中计算生存率?

问题描述

(方言可以是VerticaImpalaDatabricks

我正在尝试计算用户的第 0 天、第 1 天……直到第 7 天的存活率。我将某个日期的所有用户视为 d0(无论他们是新用户还是旧用户),并查看其中有多少用户在 d1、d2 等时间返回。假设我们有以下数据:

user | login_date
-----------------
001  | 2019-11-01
002  | 2019-11-01
003  | 2019-11-01
004  | 2019-11-01
005  | 2019-11-01
001  | 2019-11-02
003  | 2019-11-02
004  | 2019-11-02
006  | 2019-11-02
007  | 2019-11-02
002  | 2019-11-03
003  | 2019-11-03
004  | 2019-11-03
005  | 2019-11-03
008  | 2019-11-03
001  | 2019-11-04
002  | 2019-11-04
006  | 2019-11-04
007  | 2019-11-04
009  | 2019-11-04

我想看到这样的东西:

date      |d0 |d1 |d2 |d3
--------------------------
2019-11-01| 5 | 3 | 4 | 2
2019-11-02| 5 | 2 | 3 | 
2019-11-03| 5 | 1
2019-11-04| 5

所以可以看到 d0 是 5(即使有些用户之前登录过),比如我们在 2019-11-02 有001, 003, 004, 006,007第二天又回来了 2 个。

现在我开发了一个接近我的目标的查询,但不一样。

WITH cte1 AS (
    SELECT
        user, 
        login_date,
        FIRST_VALUE(login_date) OVER (PARTITION BY user ORDER BY login_date) AS first_login_day,
        DATEDIFF(login_date, first_login_day) AS days_since_first_play
    FROM
        table
)
SELECT
    first_login_day,
    SUM(CASE WHEN days_since_first_play = 0 THEN 1 ELSE 0 END) AS d0,
    SUM(CASE WHEN days_since_first_play = 1 THEN 1 ELSE 0 END) AS d1,
    SUM(CASE WHEN days_since_first_play = 2 THEN 1 ELSE 0 END) AS d2,
    SUM(CASE WHEN days_since_first_play = 3 THEN 1 ELSE 0 END) AS d3,
    SUM(CASE WHEN days_since_first_play = 4 THEN 1 ELSE 0 END) AS d4,
    SUM(CASE WHEN days_since_first_play = 5 THEN 1 ELSE 0 END) AS d5,
    SUM(CASE WHEN days_since_first_play = 6 THEN 1 ELSE 0 END) AS d6,
    SUM(CASE WHEN days_since_first_play = 7 THEN 1 ELSE 0 END) AS d7
FROM
    cte1
GROUP BY
    first_login_day
ORDER BY
    first_login_day

查询的问题在于它从我正在查看的日期中删除了旧玩家。例如,使用相同的数据,因为001,已经在 2019 年 11 月 1 日登录,所以 2019 年 11 月 2 日的003值将是 2 而不是 5。所以此查询仅在我查看新用户时才有效.004d0

我想知道是否可以更改查询以实现我想要的?先谢谢了~~

标签: sqlvertica

解决方案


一些 self-left join 和不同的用户计数会给出这样的结果。

SELECT t0.login_date,
COUNT(distinct t0.user) as d0,
COUNT(distinct t1.user) as d1,
COUNT(distinct t2.user) as d2,
COUNT(distinct t3.user) as d3
FROM table t0
LEFT JOIN table t1 
  ON t1.user = t0.user
 AND t1.login_date = t0.login_date + 1
LEFT JOIN table t2 
  ON t2.user = t0.user
 AND t2.login_date = t0.login_date + 2
LEFT JOIN table t3 
  ON t3.user = t0.user
 AND t3.login_date = t0.login_date + 3
GROUP BY t0.login_date
ORDER BY t0.login_date

但是如果login_date需要连接呢?
然后只需将 JOIN 标准更改为:

FROM table t0
LEFT JOIN table t1 
  ON t1.user = t0.user
 AND t1.login_date = t0.login_date + 1
LEFT JOIN table t2 
  ON t2.user = t1.user
 AND t2.login_date = t1.login_date + 1
LEFT JOIN table t3 
  ON t3.user = t2.user
 AND t3.login_date = t2.login_date + 1

推荐阅读