首页 > 解决方案 > 如何使用种子生成没有交叉的样本?

问题描述

我正在尝试使用 SAMPLE() 和 SEED() 生成唯一的数据样本。这意味着样本不应相互交叉。每个样本都应该是唯一的。但我没有达到我的目标,因为当我使用这种方法时,Snowflake 生成的样本与前一个样本有 10% 的交集:

CREATE TABLE employee_sample_10_1 as 
SELECT employee_id FROM employee SAMPLE(10) SEED(1);

CREATE TABLE employee_sample_10_2 as 
SELECT employee_id FROM employee SAMPLE(10) SEED(2);

CREATE TABLE employee_sample_10_3 as 
SELECT employee_id FROM employee SAMPLE(10) SEED(3);

CREATE TABLE employee_sample_10_4 as 
SELECT employee_id FROM employee SAMPLE(10) SEED(4);

CREATE TABLE employee_sample_10_5 as 
SELECT employee_id FROM employee SAMPLE(10) SEED(5);

CREATE TABLE employee_sample_10_6 as 
SELECT employee_id FROM employee SAMPLE(10) SEED(6);

CREATE TABLE employee_sample_10_7 as 
SELECT employee_id FROM employee SAMPLE(10) SEED(7);

CREATE TABLE employee_sample_10_8 as 
SELECT employee_id FROM employee SAMPLE(10) SEED(8);

CREATE TABLE employee_sample_10_9 as 
SELECT employee_id FROM employee SAMPLE(10) SEED(9);

CREATE TABLE employee_sample_10_0 as 
SELECT employee_id FROM employee SAMPLE(10) SEED(0);

理想情况下,作为样本创建的所有这些表不应相交,并且它们的总行数应等于原始表 EMPLOYEE 中的行数。事实上,这些条件都不成立。行数不匹配,此外,每个单独的样本表将与其邻居大致相交约 10%。

SELECT employee_id FROM employee_sample_10_1 INTERSECT SELECT employee_id FROM employee_sample_10_2; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_2 INTERSECT SELECT employee_id FROM employee_sample_10_3; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_3 INTERSECT SELECT employee_id FROM employee_sample_10_4; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_4 INTERSECT SELECT employee_id FROM employee_sample_10_5; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_5 INTERSECT SELECT employee_id FROM employee_sample_10_6; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_6 INTERSECT SELECT employee_id FROM employee_sample_10_7; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_7 INTERSECT SELECT employee_id FROM employee_sample_10_8; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_8 INTERSECT SELECT employee_id FROM employee_sample_10_9; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_9 INTERSECT SELECT employee_id FROM employee_sample_10_0; --getting ~10% of data intersect, need 0% 

问题:如何使 SAMPLE() 和 SEED() 只产生唯一的值集?

标签: snowflake-cloud-data-platform

解决方案


您可以为每一行分配一个随机员工组;从该结果集创建一个临时表(或者如果您愿意,可以使用 result_scan 方法),然后相应地插入到 10 个表中,如有必要,删除用于分配组的列。

下面是使用 Snowflake 示例数据中的 CUSTOMER 表的示例:

create temp table RANDOM_EMPLOYEE_GROUPS as
select uniform(0, 9, random()) as RANDOM_GROUP, * 
from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER";

select * from RANDOM_EMPLOYEE_GROUPS limit 100;

-- Insert into the 10 tables from there.

推荐阅读