首页 > 解决方案 > 希望在 2+ 列中使用 DIFFERENCE() 查找重复项

问题描述

我正在尝试编写一个 SQL Select 查询,该查询使用 DIFFERENCE() 函数在数据库中查找相似名称以识别重复项。

我正在使用的代码的简短版本是:

SELECT *, DIFFERENCE(FirstName, LEAD(FirstName)  OVER (ORDER BY SOUNDEX(FirstName))) d
WHERE d >= 3

问题是我的数据库有额外的列,其中包括中间名和昵称。因此,如果我的客户有多个名字,他们可能会多次出现在数据库中,我需要将各种列相互比较。

样本数据:

+----+--------+--------+--------+--------+
|ID  |First   |Middle  |AKA1    |AKA2    |
+----+--------+--------+--------+--------+
|1   |Sally   |Ann     |NULL    |NULL    |
|2   |Ann     |NULL    |NULL    |NULL    |
|3   |Sue     |NULL    |NULL    |NULL    |
|4   |Suzy    |NULL    |NULL    |NULL    |
|5   |Patricia|NULL    |Trish   |Patty   |
|6   |Patty   |NULL    |Patricia|Trish   |
|7   |Trish   |NULL    |Patty   |Patricia|
+----+--------+--------+--------+--------+

在上面,第 1+2 行是彼此重复的,3+4 和 5+6+7 也是如此。

所以我不确定获得我想要的东西的最佳方式。这是我实际使用的代码的较长版本:

WITH A AS (SELECT *,
    SOUNDEX(FirstName) AS "FirstSoundex",
    SOUNDEX(LastName) AS "LastSoundex",
    LAG (SOUNDEX(FirstName)) OVER (ORDER BY SOUNDEX(FirstName)) AS "PreviousFirstSoundex",
    LAG (SOUNDEX(LastName)) OVER (ORDER BY SOUNDEX(LastName)) AS "PreviousLastSoundex"
    FROM Clients),

B AS (
    SELECT *,
    ISNULL(DIFFERENCE(FirstName, LEAD(FirstName)  OVER (ORDER BY FirstSoundex)),0) AS "FirstScore",
    ISNULL(DIFFERENCE(LastName, LEAD(LastName)  OVER (ORDER BY LastSoundex)),0) AS "LastScore"
    FROM A),

C AS (
    SELECT *,
        ISNULL(LAG (FirstScore) OVER (ORDER BY FirstSoundex),0) AS "PreviousFirstScore",
        ISNULL(LAG (LastScore) OVER (ORDER BY LastSoundex),0) AS "PreviousLastScore"
    FROM B
    ),

D AS (
    SELECT *,
        (CASE WHEN (PreviousFirstScore >=3 AND PreviousLastScore >=3) THEN (PreviousFirstSoundex + PreviousLastSoundex)
            WHEN (FirstScore >= 3 AND LastScore >=3) THEN (FirstSoundex + LastSoundex)
            END) AS "GroupName"
    FROM C
    WHERE ((PreviousFirstScore >=3 AND PreviousLastScore >=3) OR (FirstScore >= 3 AND LastScore >=3))
    )

SELECT *,
        LAG(GroupName) OVER (ORDER BY GroupName) AS "PreviousGroup",
        LEAD(GroupName) OVER (ORDER BY GroupName) AS "NextGroup"
    FROM D

WHERE (D.GroupName = D.PreviousGroup OR D.GroupName = D.NextGroup)

这让我可以将潜在重复的捆绑在一起,这对我来说效果很好。但是,我现在想添加一种检查多列的方法,但我不知道该怎么做。

我正在考虑创建一个工会,例如:

    SELECT ClientID,
        LastName,
        FirstName AS "TempName"
        FROM Clients
    UNION
    SELECT ClientID,
        LastName,
        MiddleName AS "TempName"
        FROM Clients
        WHERE MiddleName IS NOT NULL
...etc

但是我的 LAG() 和 LEAD() 将无法工作,因为我会有多行具有相同的 ClientID。我不想将单个客户端标识为自身的副本。

无论如何,有什么建议吗?提前致谢。

标签: sqlsql-server

解决方案


推荐阅读