首页 > 解决方案 > 测试 (90%) 和控制 (10%) 在零售商级别和 SQL 中的细分级别划分。怎么做?

问题描述

业务背景: 简单地说,我想将我的数据集拆分为测试和控制,并在未来运行一些活动。但是当我尝试拆分它时似乎涉及很多复杂性。

数据:

RetailerCode    CID        Segment
A6005         13SVC15       High
A6005         19VDE1F       Low
A6005         1B3BD1F       Medium
A6005         1B3HB48       Medium
A6005         1B3HB49       Low
A9006         1B3HB40       High
A9006         1B3HB41       High
A9006         1B3HB43       Low
A9006         1B3HB46       Medium

我有一个像这样的主表,其中包含客户及其零售商、细分市场等的列表(它有 30 多个列,但我不想在这里显示所有内容)。我想通过添加一列来拆分这个数据集,比如说“test_control”,它将告诉我们特定行是测试还是控制类别。我可以随机拆分,但我需要遵循以下规则,

-   If a customer is tied to two or more retailers, then he should be in control group
-   Each retailer will be provided with list of customers to target for the campaigns and the retailer will run the campaign. Here, 
o   Test-Control split should be done at Retailer level and then at segment level. For example,  For each retailer
   10% of their High customers to control and remaining 90% of their high customers to test.
   10% of their Medium customers to control and remaining 90% of their Medium customers to test
   10% of their Low customers to control and remaining 90% of their Low customers to test.

我可以分享我写的查询,但这会让你们完全困惑

但这是我遵循的高级方法

-   Took out the records from the main table where customer occurred more than once and added a “test_control” column with “control” in it and saved it in a separate temp table 1
-   Took out the records from the main table where customer occurred only once and saved it in a temp table 2
-   On table 2, I tried to split to 10% (control) and 90% (test)
-   Used union all to merge table 1 and table 2. But the ouput is totally wrong   

预期产出

在此处输入图像描述

如您所见,它在细分级别也分为 10% 和 90%

请有人帮我解决这个问题。过去几天一直在努力,但没有运气

提前致谢

我写的查询

--Tagging each row whether it is repated more than twice or not
select * into #Repeat from (
SELECT CID, Count(*) as number,
case when Count(*)>1 then '1'
 else '0'
end as repeat
FROM #Target GROUP BY CID
) temp


--Joining the above table to the master table and creating a new table
SELECT * into #T from(
select a.*,b.repeat from #Target a
left join #Repeat b ON a.CID = b.CID
) temp
alter table #T
add t_c Varchar(400)
UPDATE #T 
SET     t_c =  CASE  
                        WHEN repeat = '1' THEN 'control' 
                        ELSE repeat
                    END
--creating a sub table which has no repetitive customer
SELECT * into #T1
from #T where t_c <> 'control'

--creating a sub table which has repetitive customer
select * from #T
SELECT * into #T2
from #T where t_c = 'control'



--splitting the table(#T1) into test and control 
select * into #T3 from
(select *, 
       (case 
             when row_number() over (partition by RetailerCode,SEGMENT order by newid()) <= ((1-0.9-((select count(*) from #Target)/(select count(*) from #T2))) * count(*) over (partition by RetailerCode,SEGMENT)) then 'control'
             else 'test'
        end) as t_c_new
from #T1
where RETAILERCODE IN (SELECT DISTINCT RETAILERCODE FROM #Target
WHERE CID IN(
SELECT CID FROM
(SELECT  CID, COUNT(*) AS NoOfOccurrences
FROM #Target GROUP BY CID
HAVING COUNT(*) = 1 ) temp
))
)temp1


--renaming a column 
EXEC tempdb.sys.sp_rename N'#T2.t_c', N't_c_new', N'COLUMN';

ALTER TABLE #T3 DROP COLUMN t_c

--Merging the output 
select * into #T4 from
(SELECT * from #T3 --1085612
union all
select * from #T2 -- 89622
) temp

--QA check…this is where I found out my logic is wrong

select RetailerCode, t_c_new, Segment
 from
#T4 group by RetailerCode, t_c_new, Segment
order by RetailerCode, t_c_new, Segment

标签: sql-servertsql

解决方案


如果您想将客户分配到特定组,您应该首先订购他们:

SELECT RetailerCode, CID, Segment,
   CASE WHEN Percent_Rank()
              Over (PARTITION BY retailercode, segment -- for each retailer/segment
                    ORDER BY ControlGroup, newid()     -- all customers with multiple retailers are sorted low, i.e. will be in control group (if it's less than 10%)
                   ) <= 0.1 
             THEN 'control'
             ELSE 'test'
   END AS GROUP
FROM
 (
   SELECT t.*,
      -- flag customers to be put in control group
      CASE WHEN Min(RetailerCode) Over (PARTITION BY CID)
              = Max(RetailerCode) Over (PARTITION BY CID)
           THEN 1 -- only a single retailer
           ELSE 0 -- multiple retailers 
      END AS ControlGroup
-- if the RetailerCode/CID combination is unique:
--      CASE WHEN Count(*) Over (PARTITION BY CID) = 1
--           THEN 1 -- only a single retailer
--           ELSE 0 -- multiple retailers 
--      END AS ControlGroup
   FROM tab t
 ) AS dt;

推荐阅读