首页 > 解决方案 > 分层随机样本以匹配 BigQuery 中的不同表

问题描述

这应该是这个问题的简单扩展,但我的结果不正确,我想不通。我希望我从中绘制的表格中的比例与另一张表格的比例相匹配。我也想把它分为两类。我认为它应该是这样的:

WITH table AS (
  SELECT *
  FROM `another_table` a
), table_stats AS (
  SELECT *, SUM(c) OVER() total 
  FROM (
    SELECT cat1, cat2, COUNT(*) c 
    FROM table
    GROUP BY cat1, cat2
    HAVING c>1000000)
)

SELECT COUNT(*) samples, cat1, cat2, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
  SELECT id, cat1, cat2, c  
  FROM table `fh-bigquery.reddit_comments.2018_09`
  JOIN table_stats b
  USING(cat1, cat2)
  WHERE RAND()< 1000/total
)
GROUP BY 2, 3

这应该提供大约 1000 行,但结果要高得多,并且百分比计算已关闭。

标签: sqlgoogle-bigquery

解决方案


我认为你的rand()比较是关闭的:

WITH table AS (
      SELECT a.*
      FROM `another_table` a
     ),
    table_stats AS (
       SELECT cc.*, SUM(c) OVER () as total 
       FROM (SELECT cat1, cat2, COUNT(*) as c 
             FROM table
             GROUP BY cat1, cat2
             HAVING c > 1000000
            ) cc
      )
SELECT COUNT(*) as num_samples, cat1, cat2, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (SELECT id, cat1, cat2, c  
      FROM (select t.*, COUNT(*) OVER () as t_total,
                   COUNT(*) OVER (PARTITION BY cat1, cat2) as tcc_total
            from table `fh-bigquery.reddit_comments.2018_09` t
           ) t JOIN
           table_stats b
           USING (cat1, cat2)
      WHERE RAND() < (1000.0 / t.t_total) * (c / total) / (tcc_total / t_total)
     ) t
GROUP BY 2, 3;

请注意,您需要第二个表的总大小才能使样本大小(大约)正确。

这也是随机的。如果你真的想要一个分层样本,那么你应该在一个订单集上做第 n 个样本。如果您对此感兴趣,请提出一个问题,并附上适当的样本数据、期望的结果和解释。


推荐阅读