snowflake-cloud-data-platform - 如何使用种子生成没有交叉的样本?
问题描述
我正在尝试使用 SAMPLE() 和 SEED() 生成唯一的数据样本。这意味着样本不应相互交叉。每个样本都应该是唯一的。但我没有达到我的目标,因为当我使用这种方法时,Snowflake 生成的样本与前一个样本有 10% 的交集:
CREATE TABLE employee_sample_10_1 as
SELECT employee_id FROM employee SAMPLE(10) SEED(1);
CREATE TABLE employee_sample_10_2 as
SELECT employee_id FROM employee SAMPLE(10) SEED(2);
CREATE TABLE employee_sample_10_3 as
SELECT employee_id FROM employee SAMPLE(10) SEED(3);
CREATE TABLE employee_sample_10_4 as
SELECT employee_id FROM employee SAMPLE(10) SEED(4);
CREATE TABLE employee_sample_10_5 as
SELECT employee_id FROM employee SAMPLE(10) SEED(5);
CREATE TABLE employee_sample_10_6 as
SELECT employee_id FROM employee SAMPLE(10) SEED(6);
CREATE TABLE employee_sample_10_7 as
SELECT employee_id FROM employee SAMPLE(10) SEED(7);
CREATE TABLE employee_sample_10_8 as
SELECT employee_id FROM employee SAMPLE(10) SEED(8);
CREATE TABLE employee_sample_10_9 as
SELECT employee_id FROM employee SAMPLE(10) SEED(9);
CREATE TABLE employee_sample_10_0 as
SELECT employee_id FROM employee SAMPLE(10) SEED(0);
理想情况下,作为样本创建的所有这些表不应相交,并且它们的总行数应等于原始表 EMPLOYEE 中的行数。事实上,这些条件都不成立。行数不匹配,此外,每个单独的样本表将与其邻居大致相交约 10%。
SELECT employee_id FROM employee_sample_10_1 INTERSECT SELECT employee_id FROM employee_sample_10_2; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_2 INTERSECT SELECT employee_id FROM employee_sample_10_3; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_3 INTERSECT SELECT employee_id FROM employee_sample_10_4; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_4 INTERSECT SELECT employee_id FROM employee_sample_10_5; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_5 INTERSECT SELECT employee_id FROM employee_sample_10_6; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_6 INTERSECT SELECT employee_id FROM employee_sample_10_7; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_7 INTERSECT SELECT employee_id FROM employee_sample_10_8; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_8 INTERSECT SELECT employee_id FROM employee_sample_10_9; --getting ~10% of data intersect, need 0%
SELECT employee_id FROM employee_sample_10_9 INTERSECT SELECT employee_id FROM employee_sample_10_0; --getting ~10% of data intersect, need 0%
问题:如何使 SAMPLE() 和 SEED() 只产生唯一的值集?
解决方案
您可以为每一行分配一个随机员工组;从该结果集创建一个临时表(或者如果您愿意,可以使用 result_scan 方法),然后相应地插入到 10 个表中,如有必要,删除用于分配组的列。
下面是使用 Snowflake 示例数据中的 CUSTOMER 表的示例:
create temp table RANDOM_EMPLOYEE_GROUPS as
select uniform(0, 9, random()) as RANDOM_GROUP, *
from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER";
select * from RANDOM_EMPLOYEE_GROUPS limit 100;
-- Insert into the 10 tables from there.
推荐阅读
- linux - 记录成功的命令
- c - Concatenating strings in C: string buffer is not getting cleared after function call
- reactjs - 从 Next js API 上传文件到服务器
- reactjs - React useEffect 依赖项正在触发更新
- php - 基本 url 捕获未定义的 PHP codeigniter
- .net - Azure Pipeline 在 `dotnet build` 上失败,并出现错误“找不到命令或文件”
- javascript - 如何检查用户输入是否正确(用于测验)
- c# - .Net 中 MongoDB 的异步排序
- apache-kafka - 如果偏移量在 Kafka 中损坏会发生什么?
- html - CSS响应式设计:为什么显示分辨率和媒体查询值不一样?