sql - 希望在 2+ 列中使用 DIFFERENCE() 查找重复项
问题描述
我正在尝试编写一个 SQL Select 查询,该查询使用 DIFFERENCE() 函数在数据库中查找相似名称以识别重复项。
我正在使用的代码的简短版本是:
SELECT *, DIFFERENCE(FirstName, LEAD(FirstName) OVER (ORDER BY SOUNDEX(FirstName))) d
WHERE d >= 3
问题是我的数据库有额外的列,其中包括中间名和昵称。因此,如果我的客户有多个名字,他们可能会多次出现在数据库中,我需要将各种列相互比较。
样本数据:
+----+--------+--------+--------+--------+
|ID |First |Middle |AKA1 |AKA2 |
+----+--------+--------+--------+--------+
|1 |Sally |Ann |NULL |NULL |
|2 |Ann |NULL |NULL |NULL |
|3 |Sue |NULL |NULL |NULL |
|4 |Suzy |NULL |NULL |NULL |
|5 |Patricia|NULL |Trish |Patty |
|6 |Patty |NULL |Patricia|Trish |
|7 |Trish |NULL |Patty |Patricia|
+----+--------+--------+--------+--------+
在上面,第 1+2 行是彼此重复的,3+4 和 5+6+7 也是如此。
所以我不确定获得我想要的东西的最佳方式。这是我实际使用的代码的较长版本:
WITH A AS (SELECT *,
SOUNDEX(FirstName) AS "FirstSoundex",
SOUNDEX(LastName) AS "LastSoundex",
LAG (SOUNDEX(FirstName)) OVER (ORDER BY SOUNDEX(FirstName)) AS "PreviousFirstSoundex",
LAG (SOUNDEX(LastName)) OVER (ORDER BY SOUNDEX(LastName)) AS "PreviousLastSoundex"
FROM Clients),
B AS (
SELECT *,
ISNULL(DIFFERENCE(FirstName, LEAD(FirstName) OVER (ORDER BY FirstSoundex)),0) AS "FirstScore",
ISNULL(DIFFERENCE(LastName, LEAD(LastName) OVER (ORDER BY LastSoundex)),0) AS "LastScore"
FROM A),
C AS (
SELECT *,
ISNULL(LAG (FirstScore) OVER (ORDER BY FirstSoundex),0) AS "PreviousFirstScore",
ISNULL(LAG (LastScore) OVER (ORDER BY LastSoundex),0) AS "PreviousLastScore"
FROM B
),
D AS (
SELECT *,
(CASE WHEN (PreviousFirstScore >=3 AND PreviousLastScore >=3) THEN (PreviousFirstSoundex + PreviousLastSoundex)
WHEN (FirstScore >= 3 AND LastScore >=3) THEN (FirstSoundex + LastSoundex)
END) AS "GroupName"
FROM C
WHERE ((PreviousFirstScore >=3 AND PreviousLastScore >=3) OR (FirstScore >= 3 AND LastScore >=3))
)
SELECT *,
LAG(GroupName) OVER (ORDER BY GroupName) AS "PreviousGroup",
LEAD(GroupName) OVER (ORDER BY GroupName) AS "NextGroup"
FROM D
WHERE (D.GroupName = D.PreviousGroup OR D.GroupName = D.NextGroup)
这让我可以将潜在重复的捆绑在一起,这对我来说效果很好。但是,我现在想添加一种检查多列的方法,但我不知道该怎么做。
我正在考虑创建一个工会,例如:
SELECT ClientID,
LastName,
FirstName AS "TempName"
FROM Clients
UNION
SELECT ClientID,
LastName,
MiddleName AS "TempName"
FROM Clients
WHERE MiddleName IS NOT NULL
...etc
但是我的 LAG() 和 LEAD() 将无法工作,因为我会有多行具有相同的 ClientID。我不想将单个客户端标识为自身的副本。
无论如何,有什么建议吗?提前致谢。
解决方案
推荐阅读
- c - 如果thread1多次获取锁,为什么thread2不能获取锁?
- java - 在 java11 上播种后,SHA1PRNG SecureRandom 行为不同
- java - Swift iOS 类中的方法和变量的数量
- java - 如何在 IntelliJ Idea 中添加远程 Maven 存储库?
- azure-devops - 如何使用命令行任务将 Azure Pipeline 变量打印到文件?
- debugging - 应用程序的发布版本未请求许可。调试版本正常工作
- sql - 空结果预期实际值
- php - codeigniter 中的分页 URL 问题
- webots - 为什么我不能在我的原型的 BallJoint 节点的“设备”字段中插入 RotationalMotor 节点?
- linux - 如何杀死已使用源执行的脚本?