sql - Iterated sub sampling against distinct values, union results
问题描述
I made a SQL fiddle here
I have a table that has for each row: a category, an document id and a ranking.
The categories are ranked within themselves. For each category, I would like to select a sub sample. All the sub samples should be stacked together in a table.
The catch is that I would like to sub sample by iteratively fetching a halved row index among that category, e.g. if a given category has 32 items, then I would like to fetch rows 32, 16, 8, 4, 2, 1.
In my SQL fiddle I was able to do this for one particular category but I can't figure out how to:
a) do it for all categories in [Major Focus Area] b) union the resulting subsamples into one table
Any hints or help is much appreciated! I am working in TSQL (MS SQL Server)
Sample data (MS Sql):
CREATE TABLE Rank_MajorAreas
([Rank] int, [Major Focus Area] varchar(17), [ID] int)
;
INSERT INTO Rank_MajorAreas
([Rank], [Major Focus Area], [ID])
VALUES
(1, 'Welfare', 71366),
(2, 'Welfare', 70415),
(3, 'Truck Driving', 70423),
(4, 'Peasant''s Office', 74566),
(5, 'Peasant''s Office', 71560),
(6, 'Nail Therapy', 77497),
(7, 'Truck Driving', 76193),
(8, 'Truck Driving', 79226),
(9, 'Truck Driving', 70222),
(10, 'Welfare', 77336),
(11, 'Truck Driving', 70823),
(12, 'Welfare', 77096),
(13, 'Welfare', 71335),
(14, 'Nail Therapy', 73551),
(15, 'Welfare', 72146),
(16, 'Truck Driving', 74023),
(17, 'Welfare', 71546),
(18, 'Nail Therapy', 74755),
(19, 'Peasant''s Office', 77834),
(20, 'Welfare', 75667),
(21, 'Peasant''s Office', 71342),
(22, 'Peasant''s Office', 77457),
(23, 'Peasant''s Office', 77923),
(24, 'Welfare', 76508),
(25, 'Welfare', 75714),
(26, 'Welfare', 73654),
(27, 'Welfare', 75753),
(28, 'Truck Driving', 71481),
(29, 'Truck Driving', 79424),
(30, 'Peasant''s Office', 76143),
(31, 'Truck Driving', 74076),
(32, 'Nail Therapy', 78714),
(33, 'Nail Therapy', 79924),
(34, 'Welfare', 71482),
(35, 'Welfare', 70050),
(36, 'Welfare', 76053),
(37, 'Nail Therapy', 79591),
(38, 'Peasant''s Office', 75197),
(39, 'Nail Therapy', 74104),
(40, 'Welfare', 72891),
(41, 'Truck Driving', 73621),
(42, 'Peasant''s Office', 71713),
(43, 'Welfare', 71979),
(44, 'Peasant''s Office', 71601),
(45, 'Peasant''s Office', 73928),
(46, 'Nail Therapy', 71759),
(47, 'Nail Therapy', 70379),
(48, 'Welfare', 71215),
(49, 'Truck Driving', 70908),
(50, 'Welfare', 71989)
;
Code thus far:
CREATE VIEW MFA AS
SELECT ROW_NUMBER() OVER(ORDER BY fa.[Rank] ASC) AS Row
,*
FROM Rank_MajorAreas AS fa
-- ideally we could make a view per Focus Area
WHERE fa.[Major Focus Area] = 'Welfare'
ORDER BY Row ASC
OFFSET 0 ROWS;
DECLARE @start int
SELECT @start = (SELECT COUNT(*) FROM MFA)
;WITH Sample( Row ) AS
(
Select @start as Row
UNION ALL
SELECT ROUND(Row/2, 0)
FROM Sample
WHERE Row > 0
)
SELECT * FROM MFA AS mfa
INNER JOIN Sample AS s on s.Row = mfa.Row
ORDER BY mfa.Row ASC
Desired Results, where each focus area is subsampled, the subsamples are returned all together as a single result
Row Rank Major Focus Area ID
1 1 Welfare 71366
2 2 Welfare 70415
4 12 Welfare 77096
9 24 Welfare 76508
19 50 Welfare 71989
...
1 6 Nail Therapy 77497
2 14 Nail Therapy 73551
4 32 Nail Therapy 78714
9 47 Nail Therapy 7037
解决方案
您需要在子句中使用PARTITION BY
onMajor Focus Area
列。OVER
以下是修改后的 TSQL
CREATE VIEW MFA AS
SELECT ROW_NUMBER() OVER(PARTITION BY fa.[Major Focus Area] ORDER BY fa.[Rank] ASC) AS Row
,*
FROM Rank_MajorAreas AS fa
-- ideally we could make a view per Focus Area
ORDER BY [Major Focus Area], Row ASC
OFFSET 0 ROWS;
DECLARE @start int
SELECT @start = (SELECT COUNT(*) FROM MFA)
;WITH Sample( Row, fa ) AS
(
Select COUNT(*) as Row, [Major Focus Area] as fa FROM MFA GROUP BY [Major Focus Area]
UNION ALL
SELECT ROUND(Row/2, 0), fa
FROM Sample
WHERE Row > 0
)
SELECT mfa.Row, mfa.Rank, mfa.[Major Focus Area] FROM MFA AS mfa
INNER JOIN Sample AS s on s.Row = mfa.Row and s.fa=mfa.[Major Focus Area]
ORDER BY [Major Focus Area], mfa.Row ASC
推荐阅读
- node.js - 如何在以太坊中创建服务器套接字?
- javascript - 需要对某个文件的所有依赖递归应用一个加载器
- python - 如何在连续的标题处拆分数据帧
- python - 在 Venv 中升级 Pip 会覆盖 Python 安装
- javascript - 查找特定字符串并将其大写
- javascript - 在具有标题属性的表格单元格上添加 css 伪类
- r - R - 计算矩阵列中“真”值的数量并分配“假”
- django - Django/Celery - 对象匹配查询仅在使用延迟时不存在
- python - 并发访问 pandas.Dataframe 慢
- delphi - 我在哪个单元中找到 TidHeaderList?