python - 如何根据访问频率将回访用户分类到不同的时间段
问题描述
所以我有一个 visit_log 表,它基本上跟踪应用程序的多个 user_visit。它具有 user_id 和 timestamp 作为列。我已经能够聚合 user_visit 数据,例如,现在我知道有多少周、双周、月、季度和年用户使用以下逻辑访问了应用程序:
SELECT DISTINCT ,user_id
,count( distinct date_part('year', timestamp) * 100 + date_part('week', timestamp)) as week
,count( distinct date_part('year', timestamp) * 100 + ceiling(date_part('week', timestamp)::decimal/2)) as biweek
,count( distinct date_part('year', timestamp) * 100 + date_part('month', timestamp)) as month
,count( distinct date_part('year', timestamp) * 100 + ceiling(date_part('month', timestamp)::decimal/3)) as quarter
,count( distinct date_part('year', timestamp)) as year
FROM visit_logs
现在我想更进一步,将用户定义为每周、每两周、每月、每季度、每年和很少返回的用户,以便他们在所有存储桶中遵循 80% 的阈值,以澄清用户是否返回应用程序在所选时间范围内一周内超过 80% 的时间,那么他是每周用户等等。
下面是我尝试更进一步并复制上述 Python 逻辑的 SQL 代码:
, temp AS
(
SELECT
user_id
,SUM(week) * 0.8 as total_weeks
,SUM(biweek) * 0.8 as total_biweek
,SUM(month) * 0.8 as total_months
,SUM(quarter) * 0.8 as total_quarters
,SUM(year) * 0.8 as total_year
FROM time_count
GROUP BY 1
)
,week_count as
(
SELECT CASE WHEN week > total_weeks THEN 'Weekly'
WHEN biweek > total_biweek THEN 'Biweekly'
WHEN month > total_months THEN 'Monthly'
WHEN quarter > total_quarters THEN 'quarterly'
WHEN year > total_year THEN 'yearly'
ELSE 'rarely'
END as time_bucket
FROM time_count
LEFT JOIN temp ON temp.user_id = time_count.user_id
)
SELECT * FROM week_count
不知道我哪里出错了,但它似乎没有按预期工作。请尽早提供建议/协助
解决方案
我有点糊涂了。我会使用窗口函数来解决这个问题。这是一个带有内置时间框架的示例:
select user_id,
(case when num_weeks > 0.8 * total_weeks then 'weekly'
when num_months > 0.8 * total_months then 'monthly'
when num_years > 0.8 * total_years then 'yearly'
else 'rarely'
end) as frequency
from (select user_id,
count(distinct date_trunc('week', timestamp)) as num_weeks,
count(distinct date_trunc('month', timestamp)) as num_months,
count(distinct date_trunc('year', timestamp)) as num_years,
extract(day from current_date - min(timestamp)) / 7 as total_weeks,
extract(year from age(min(timestamp, current_date))) * 12 + extract(month from age(min(timestamp, current_date))) as total_months,
extract(year from age(min(timestamp, current_date))) as total_years
from visit_logs vl
group by user_id
) vl;