首页 > 解决方案 > 确定每组的重复率

问题描述

编辑:以前的示例数据包括我需要在解决方案中计算的重复访问列。

我正在尝试确定

total_visits = total visits per website per sub_group

duplicate_visits = visits-1

duplication_rate = duplicate_visits/ total_visits

distinct_users_subgroup = distinct users per website per sub_group

distinct_users_total = distinct users per website

对于下面的示例数据,我希望成为合适的代表。该表在websitesub_group级别分组,并给出了visits每个用户的数量。如何编写查询以获得以下结果?

数据:

+---------+---------+-----------+--------+
| website | user_id | sub_group | visits |
+---------+---------+-----------+--------+
| web1    |       1 | A         |      1 |
| web1    |       1 | B         |      3 |
| web1    |       1 | C         |      4 |
| web1    |       1 | D         |      1 |
| web1    |       2 | A         |      3 |
| web1    |       2 | B         |      4 |
| web1    |       3 | B         |      2 |
| web1    |       3 | C         |      1 |
| web1    |       3 | D         |      1 |
| web2    |       1 | A         |      3 |
| web2    |       1 | B         |      1 |
| web2    |       1 | C         |      2 |
| web2    |       4 | B         |      1 |
| web2    |       4 | D         |      2 |
| web2    |       5 | C         |      3 |
| web2    |       5 | D         |      4 |
+---------+---------+-----------+--------+

结果:

+---------+-----------+--------------+------------------+----------------+-------------------------+----------------------+
| website | sub_group | total_visits | duplicate_visits | duplicate_rate | distinct_users_subgroup | distinct_users_total |
+---------+-----------+--------------+------------------+----------------+-------------------------+----------------------+
| web1    | A         |            4 |                2 |           0.50 |                       2 |                    3 |
| web1    | B         |            9 |                6 |           0.67 |                       3 |                    3 |
| web1    | C         |            5 |                3 |           0.60 |                       2 |                    3 |
| web1    | D         |            2 |                0 |           0.00 |                       2 |                    3 |
| web2    | A         |            3 |                2 |           0.67 |                       1 |                    3 |
| web2    | B         |            2 |                0 |           0.00 |                       2 |                    3 |
| web2    | C         |            5 |                3 |           0.60 |                       2 |                    3 |
| web2    | D         |            6 |                4 |           0.67 |                       2 |                    3 |
+---------+-----------+--------------+------------------+----------------+-------------------------+----------------------+

标签: sqlhive

解决方案


嗨 statq 恢复到一个表可能有点混乱,所以我将分部分进行 - 目前我只有一个访问数据库来运行查询,所以我将保持 sql 非常简单(取决于你可以使用的系统其他使代码更紧凑的 SQL 动词):

=> 该表称为“teste”

第一个获取除最后两列之外的所有列的 SQL:

SELECT website, 
       sub_group, 
       Sum(visits) AS total_visits, 
       Count(*) AS duplicate_visits, 
       1-Count(*)/Sum(visits) AS duplicate_rate
FROM teste
GROUP BY website, sub_group

第二个 sql 获取倒数第二列的数据

SELECT website, 
       sub_group, 
       count(user_id) AS distinct_users_subgroup
FROM (SELECT website, 
             sub_group, 
             user_id 
      FROM teste
      GROUP BY website, sub_group, user_id)
GROUP BY website, sub_group

最后一列的第三个 sql:

SELECT website, 
       Count(user_id) AS distinct_users_total 
FROM (SELECT website, 
             user_id 
      FROM teste
      GROUP BY website, user_id )  
GROUP BY website

可以将这 3 个结果合并到一个结果框架中,尽管它在我手头的简化 SQL 数据库中看起来非常混乱(我无法测试这部分,所以可能有错字)

SELECT a.*,
       b.distinct_users_subgroup, 
       c.distinct_users_total 
FROM (SELECT website, 
             sub_group, 
             Sum(visits) AS total_visits, 
             Count(*) AS duplicate_visits, 
             1-Count(*)/Sum(visits) AS duplicate_rate
      FROM teste
      GROUP BY website, sub_group) as a
INNER JOIN (SELECT website, 
                   sub_group, 
                   count(user_id) AS distinct_users_subgroup
             FROM (SELECT website, 
                          sub_group, 
                          user_id 
                   FROM teste
                   GROUP BY website, sub_group, user_id)
             GROUP BY website, sub_group) as b on a.website = b.website and a.sub_group = b.sub_group
INNER JOIN (SELECT website, 
                   Count(user_id) AS distinct_users_total 
            FROM (SELECT website, 
                         user_id 
                  FROM teste
                  GROUP BY website, user_id )  
                        GROUP BY website) as c on a.website = c.website

推荐阅读