首页 > 解决方案 > 范围内的分层样本

问题描述

我有 table_1,其中包含以下数据:

Range Start   Range End   Frequency   
10                   20          90   
20                   30          68   
30                   40         314   
40                   40         191 (here, it means we have just 40 as data point repeating 191 times) 

表_2:

group     value   
10        56.1   
10        88.3   
20        53   
20        20   
30        55   

我需要根据 table_1 的范围获取分层样本,table_2 可以有数百万行,但结果应该限制在 10k 点。

尝试以下查询:

SELECT   
    d.*   
FROM   
    (   
        SELECT   
            ROW_NUMBER() OVER(   
                                PARTITION BY group   
                                ORDER BY group   
                            ) AS seqnum,   
            COUNT(*) OVER() AS ct,   
            COUNT(*) OVER(PARTITION BY group) AS cpt,   
            group, value   
        FROM   
            table_2 d   
    ) d   
WHERE   
    seqnum < 10000 * ( cpt * 1.0 / ct )   

但对这里的分析功能使用有点困惑。

期望 10k 记录作为来自 table_2 的分层样本:

结果表:

group     value   
10       56.1   
20       53   
20        20   
30       55

标签: sqloracle

解决方案


这意味着您至少需要每组的一条记录和随机的更多记录,然后尝试以下操作:

SELECT GROUP, VALUE FROM
(SELECT T2.GROUP, T2.VALUE, 
ROW_NUMBER() 
OVER (PARTITION BY T2.GROUP ORDER BY NULL) AS RN
FROM TABLE_1 T1
JOIN TABLE_2 T2
ON(T1.RANGE = T2.GROUP))
WHERE RN = 1 OR
CASE WHEN RN > 1 
AND RN = CEIL(DBMS_RANDOM.VALUE(1,RN))
THEN 1 END = 1
FETCH FIRST 10000 ROWS ONLY;

在这里,Rownum对每个组随机取,然后如果它们满足随机条件,则结果取rownum1 和其他。rownum

干杯!!


推荐阅读