首页 > 解决方案 > 了解窗口函数以在保留真实更改的同时删除重复记录

问题描述

我在 dba 堆栈交换中问过这个问题,但没有任何运气。交叉发布。

SQLFIDDLE

我接近解决这个问题,但我只是被困在墙上。我正在尝试理解Aaron Betrand的一篇文章,并将其应用于我遇到的一种情况,在这种情况下,由于我继承了先前的设计错误,我有一个大量重复的更改表。示例数据集在概念上与我的真实数据集相同,除了 SortOrder 通常是日期时间值而不是整数。我试过的代码在这里:

; with main as (
   select *, ROW_NUMBER() over (partition by ID, Val, sortorder order by ID,
      SortOrder) as "Rank",
      row_number() over (partition by ID, val order by ID, sortorder) as "s_rank" 
   from 
      (values (1, 'A', 1), (1, 'A', 1), (1, 'B', 2), (1, 'C', 3), (1, 'B', 4),
              (1, 'A', 5), (1, 'A', 5), (2, 'A', 1), (2, 'B', 2), (2, 'A', 3), 
              (3, 'A', 1), (3, 'A', 1), (3, 'A', 2)
      ) as x("ID", "VAL", "SortOrder")
   group by id, val, SortOrder
   --order by ID, "SortOrder"
),
cte_rest as (
   select *
   from main
   where "s_rank" > 1
)

select *
from main
left join cte_rest rest
   on main.id = rest.id
   and main.s_rank > 1
   and main.SortOrder = rest.SortOrder
--where not exists (select 1 from cte_rest r where r.id = main.id and r.val <> main.VAL and main.s_rank < s_rank)
order by main.ID, main.SortOrder

结果几乎是有效的;但是,最后一行突出显示了我无法解释的情况:日期更改,值没有更改。我希望排除最后一条记录,因为它不是真正的值更改。

╔════╦═════╦═══════════╦══════╦════════╦══════╦══════╦═══════════╦══════╦════════╗
║ ID ║ VAL ║ SortOrder ║ Rank ║ s_rank ║  ID  ║ VAL  ║ SortOrder ║ Rank ║ s_rank ║
╠════╬═════╬═══════════╬══════╬════════╬══════╬══════╬═══════════╬══════╬════════╣
║  1 ║ A   ║         1 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  1 ║ B   ║         2 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  1 ║ C   ║         3 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  1 ║ B   ║         4 ║    1 ║      2 ║ 1    ║ B    ║ 4         ║ 1    ║ 2      ║
║  1 ║ A   ║         5 ║    1 ║      2 ║ 1    ║ A    ║ 5         ║ 1    ║ 2      ║
║  2 ║ A   ║         1 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  2 ║ B   ║         2 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  2 ║ A   ║         3 ║    1 ║      2 ║ 2    ║ A    ║ 3         ║ 1    ║ 2      ║
║  3 ║ A   ║         1 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  3 ║ A   ║         2 ║    1 ║      2 ║ 3    ║ A    ║ 2         ║ 1    ║ 2      ║
╚════╩═════╩═══════════╩══════╩════════╩══════╩══════╩═══════════╩══════╩════════╝

我的一位同事建议了这段代码,虽然我可以了解它是如何到达的,但我不明白为什么第一个代码示例不起作用。在我看来,这需要大量额外的解析,并且对于大型数据集,我会担心性能影响。


WITH cte1
     AS (SELECT [id]
              , [val]
              , [sortorder]
              , ROW_NUMBER() OVER(PARTITION BY [id]
                                             , [val]
                                             , [sortorder]
                ORDER BY [id]
                       , [sortorder]) AS "rankall"
         FROM   (VALUES
                        ( 1, 'A', 1 ),
                        ( 1, 'A', 1 ),
                        ( 1, 'B', 2 ),
                        ( 1, 'C', 3 ),
                        ( 1, 'B', 4 ),
                        ( 1, 'A', 5 ),
                        ( 1, 'A', 5 ),
                        ( 2, 'A', 1 ),
                        ( 2, 'B', 2 ),
                        ( 2, 'A', 3 ),
                        ( 3, 'A', 1 ),
                        ( 3, 'A', 1 ),
                        ( 3, 'A', 2 )) AS x("id", "val", "sortorder")),
     ctedropped
     AS (SELECT [id]
              , [val]
              , [sortorder]
              , ROW_NUMBER() OVER(PARTITION BY [id]
                                             , [val]
                                             , [sortorder]
                ORDER BY [id]
                       , [sortorder]) AS "rankall"
         FROM   cte1
         WHERE  [cte1].[rankall] > 1)
     SELECT [cte1].[id]
          , [cte1].[val]
          , [cte1].[sortorder]
     FROM   cte1
     WHERE  NOT EXISTS
     (
         SELECT *
         FROM   [ctedropped]
         WHERE  [cte1].[id] = [ctedropped].[id] AND 
                [cte1].[val] = [ctedropped].[val] AND 
                [cte1].[rankall] = [ctedropped].[rankall]
     )
     ORDER BY [cte1].[id]
            , [cte1].[sortorder];

标签: sqlsql-server

解决方案


如果您只想删除值不变的行,您可以应用此逻辑:

WITH cte1 AS
 (
   SELECT [id]
        , [val]
        , [sortorder]
        , Lag(val) Over(PARTITION BY [id]
                        ORDER BY [sortorder]) AS prevval
   FROM    demo
 )
SELECT * 
FROM cte1
WHERE prevval IS NULL  -- first row
   OR prevval <> val   -- value changed

小提琴


推荐阅读