首页 > 解决方案 > 比较多个日期以识别数据库中的重复记录

问题描述

我检查并发现了大量查找重复记录的线程,但没有任何专门解决日期比较的线程。

我正在尝试使用 JOIN 来识别数据库中哪里有类似的记录。其中一些记录在某些字段中具有相同的值,但不是全部。因此,我还想考虑与其他记录相同或差异为 +1 或 +2 的日期。

所以我对“重复”的定义是Color_Type、ID_Type和Category_Cd都是一样的,Activity_Date必须与相似记录相差0-2天。

例子:

**Record 1** has a 
Activity_Date of '10/25/2021' 
Color_Type of 'Blue' 
ID_Type as '1234' Category_Cd of 'Rainbow' 
Txn_ID of '1A2B'

**Record 2** has a 
Activity_Date of '10/24/2021' 
Color_Type of 'Blue' 
ID_Type as '1234' 
Category_Cd of 'Rainbow' 
Txn_ID of '3C4D'

**Record 3** has a 
Activity_Date of '10/26/2021' 
Color_Type of 'Blue' 
ID_Type as '1234' 
Category_Cd of 'Rainbow' 
Txn_ID of '5E6F'

**Record 4** has a 
Activity_Date of '09/01/2021' 
Color_Type of 'Blue' 
ID_Type as '1234' 
Category_Cd of 'Rainbow' 
Txn_ID of '7G8H'

**Record 5** has a 
Activity_Date of '10/25/2021' 
Color_Type of 'Blue' 
ID_Type as '1234' 
Category_Cd of 'Rainbow' 
Txn_ID of '9I0J'

我希望我的查询返回记录 1、2、3 和 5,但不返回记录 4

这是我的查询,可用于查找重复项,但返回相似但不考虑我想要的日期差异的记录。

SELECT *
FROM 
    (SELECT * 
    FROM DATABASETABLE.DATABASEVIEW
    WHERE ACTIVTY_DT GE '2021-10-01'
    ) dupa  
JOIN 
    (SELECT ID_Type, Color_Type, Category_Cd, 
            COUNT (*) AS postings
    FROM DATABASETABLE.DATABASEVIEW
    WHERE ACTIVTY_DT GE '2021-10-01'
    GROUP BY 1, 2, 3
    HAVING postings = 2) dupb

    ON  dupa.ID_Type = dupb.ID_Type AND
        dupa.Color_Type = dupb.Color_Type AND
        dupa.Category_Cd = dupb.Category_Cd

ORDER BY dupa.ID_Type

对 SQL 来说真的很新(和自学),所以任何帮助都将不胜感激!!!

标签: teradata

解决方案


这似乎是您基于Windowed Aggregates的逻辑:

SELECT * 
FROM DATABASETABLE.DATABASEVIEW
WHERE ACTIVTY_DT >= DATE '2021-10-01'
QUALIFY -- current date within two days of previous date
   ACTIVTY_DT - 2
   <= 
   lag(ACTIVTY_DT)  -- prev date
   over (partition by ID_Type, Color_Type, Category_Cd -- same values in all three columns
         order by ACTIVTY_DT) 
or   -- similar condition against next row to get both rows retunrd
   ACTIVTY_DT + 2
   >= 
   lead(ACTIVTY_DT) -- next date
   over (partition by ID_Type, Color_Type, Category_Cd -- same values in all three columns
         order by ACTIVTY_DT) 

推荐阅读