sql - 减少多值连接 SQL 查询中的重复记录
问题描述
背景故事:我创建了一堆分析客户数据的存储过程。我正在查看供应商列表并尝试找出可能的重复项。它工作得很好,但每条记录都有 2 个可能的地址,当在两个地址中找到匹配项时,我会得到重复的结果。理想情况下,我只需要记录在结果中出现一次。
流程:我创建了一个“干净”版本的地址,我在其中删除了特殊字符并标准化为 USPS 标准。这有助于我匹配 West v W v W. 或 PO Box v PO Box v PO Box 等。然后我从两个地址 ([cleanAddress]
和[cleanRemit_Address]
) 中获取所有不同的地址值并放入主列表。然后,我用 a 与源表进行比较,HAVING COUNT(*) > 1
以确定哪些地址出现了多次。最后,我将出现多次的最终地址列表与源数据结合起来进行输出。
问题:如果您查看底部附近的结果,您会发现我有 2 组骗子,它们几乎相同,只是地址略有不同。Address
和本质上是相同的,因此它会在“SouthWestern Medical”和“NERO CO”的 BOTH和值Remit_Address
上找到匹配项,因此两组骗子在列表中出现两次而不是一次(请参阅底部的所需结果)。我需要匹配OR但我不知道如何限制每条记录在结果中出现一次。[cleanAddress]
[cleanRemit_Address]
[cleanAddress]
[cleanRemit_Address]
- SSMS 18
- SQL Server 2019
查询:
--SQL (Address List): Combines all addresses for a master list of all addresses in the table
SELECT * INTO [address_list] FROM (
SELECT DISTINCT [NewAdd] FROM
(
SELECT DISTINCT [cleanAddress] AS [NewAdd]
FROM [sample_data]
WHERE
( [cleanAddress] IS NOT NULL AND [cleanAddress] <> '' ) AND
( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' )
GROUP BY [cleanAddress]
UNION
SELECT DISTINCT [cleanRemit_Address] AS [NewAdd]
FROM [sample_data]
WHERE
( [cleanRemit_Address] IS NOT NULL AND [cleanRemit_Address] <> '' ) AND
( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' )
GROUP BY [cleanRemit_Address]
) q1
) q2
ORDER BY
[NewAdd]
--SQL (Address Dupes): Determines which addresses appear in the data more than once
SELECT * INTO [dupe_addresses] FROM (
SELECT [NewAdd]
FROM [address_list] n
LEFT JOIN [sample_data] pv ON
(
( n.[NewAdd] = pv.[cleanAddress] AND ( [Address] <> '' AND [Address] IS NOT NULL ) ) OR
( n.[NewAdd] = pv.[cleanRemit_Address] AND ( [Remit_Address] <> '' AND [Remit_Address] IS NOT NULL ) )
)
WHERE
( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' )
GROUP BY [NewAdd]
HAVING COUNT(*) > 1
) q1
ORDER BY [NewAdd]
--SQL (Address Query): Outputs the information of the matched addresses
SELECT
'Address Match' AS [Reason],
pv.[Supplier_No],
pv.[Name],
pv.[Address],
pv.[City],
pv.[State],
pv.[Zip],
pv.[Country],
pv.[Remit_Address],
pv.[Remit_City],
pv.[Remit_State],
pv.[Remit_Zip],
pv.[Remit_Country]
FROM
[dupe_addresses] n
LEFT JOIN [sample_data] pv
ON (
(n.[NewAdd] = pv.[cleanAddress] AND ( [Address] <> '' AND [Address] IS NOT NULL ) )
OR
(n.[NewAdd] = pv.[cleanRemit_Address] AND ( [Remit_Address] <> '' AND [Remit_Address] IS NOT NULL ) )
)
WHERE
( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' )
样本数据:
CREATE TABLE [sample_data] (
[Supplier_No] varchar(255),
[Name] varchar(255),
[Address] varchar(255),
[City] varchar(255),
[State] varchar(255),
[Zip] varchar(255),
[Country] varchar(255),
[Remit_Address] varchar(255),
[Remit_City] varchar(255),
[Remit_State] varchar(255),
[Remit_Zip] varchar(255),
[Remit_Country] varchar(255),
[cleanAddress] varchar(255),
[cleanRemit_Address] varchar(255),
CONSTRAINT [suppliers_pk] PRIMARY KEY ([Supplier_No])
)
INSERT INTO [sample_data] VALUES
('1039104','Geez Companies','100 Aero Hudson Rd','Streetsboro','OH','44241','','100 Aero Hudson Road','Streetsboro','OH','44241','USA','100 Aero Hudson Rd','100 Aero Hudson Rd'),
('1218409','SouthWestern Medical','100 West Balor Ave','Osceola','AR','72370','USA','SouthWestern Medical100 W Balor Ave','Osceola','AR','72370','USA','100 W Balor Ave','SouthWestern Medical100 W Balor Ave'),
('1243789','SouthWestern Medical','100 West Balor Ave','Osceola','AR','72370','USA','SouthWestern Medical100 West Balor Ave','Osceola','AR','72370','USA','100 W Balor Ave','SouthWestern Medical100 W Balor Ave'),
('1243636','SIRI SYSTEMS','15 BRAD ROAD','WEXFORD','PA','15090','','','','','','','15 BRAD RD',''),
('1152482','FLEETWOOD MACK','22 WINDSOCK CT','ADDISON','IL','60101','','PO BOX 951','CHICAGO','IL','60694-5124','','22 WINDSOCK CT','PO BOX 951'),
('1224483','Aerospace Junction','211500 Communicate Ave','Mingo Junction','OH','43939','USA','P O Box 99','Mingo Junction','OH','43939','USA','211500 Communicate Ave','PO Box 99'),
('1243397','Squeezy Felt','SCHREIBER DIST','NEW KENSINGTON','PA','15068','','','','','','','SCHREIBER DIST',''),
('1230895','NERO CO','28 North US State Highway 99','Osceola','AR','72370','USA','PO Box 204','Cape Girardeau','MO','63702-2045','USA','28 N US State Hwy 99','PO Box 204'),
('1243782','NERO CO','28 North US State Highway 99','Osceola','AR','72370','USA','PO Box 204','Cape Girardeau','MO','63702-2045','USA','28 N US State Hwy 99','PO Box 204'),
('1135880','RICHARD PRYOR SEMINARS','PO BOX 2194','KANSAS CITY','MO','64121-9468','USA','RICHARD PRYOR SEMINARS P O BOX 2194','KANSAS CITY','MO','64121-9468','USA','PO BOX 2194','RICHARD PRYOR SEMINARS PO BOX 2194'),
('1241328','INFINITY AND BEYOND','P.O. BOX 169','GASTONIA','NC','28053-0269','USA','','','','','','PO BOX 169',''),
('1259522','MILES STONES','PO BOX 169','GASSTONIA','NC','28053-0269','USA','','','','','','PO BOX 169',''),
('1255253','AT&T','PO Box 50221','Carol Stream','IL','60197','USA','','','','','','PO Box 50221',''),
('1135513','AT&T','PO Box 50221','Carol Stream','IL','60197-5080','USA','','','','','','PO Box 50221',''),
('1119161','Machine Co, Inc','3306 N Thorne Blvd','Chattanooga','TN','','','PO BOX 5301','CHATTANOOGA','TN','37406','USA','3306 N Thorne Blvd','PO BOX 5301'),
('1176587','Topsy Turvy','365 Welmington Road','Chicago','IL','60606','USA','','','','','','365 Welmington Rd',''),
('2156671','Topsy Turvvy, Inc.','P.O. Box 55217','Columbus','OH','43081','','365 Welmington Road','Chicago','IL','60606','USA','','365 Welmington Rd')
当前结果:
Reason Supplier_No Name Address City State Zip Country Remit_Address Remit_City Remit_State Remit_Zip Remit_Country
Address Match 1218409 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 W Balor Ave Osceola AR 72370 USA
Address Match 1243789 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 West Balor Ave Osceola AR 72370 USA
Address Match 1230895 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1243782 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1176587 Topsy Turvy 365 Welmington Road Chicago IL 60606 USA
Address Match 2156671 Topsy Turvvy, Inc. P.O. Box 55217 Columbus OH 43081 365 Welmington Road Chicago IL 60606 USA
Address Match 1241328 INFINITY AND BEYOND P.O. BOX 169 GASTONIA NC 28053-0269 USA
Address Match 1259522 MILES STONES PO BOX 169 GASSTONIA NC 28053-0269 USA
Address Match 1230895 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1243782 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1255253 AT&T PO Box 50221 Carol Stream IL 60197 USA
Address Match 1135513 AT&T PO Box 50221 Carol Stream IL 60197-5080 USA
Address Match 1218409 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA Southern Lawn Care1004 W Hale Ave Osceola AR 72370 USA
Address Match 1243789 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 West Balor Ave Osceola AR 72370 USA
期望的结果:
Reason Supplier_No Name Address City State Zip Country Remit_Address Remit_City Remit_State Remit_Zip Remit_Country
Address Match 1218409 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 W Balor Ave Osceola AR 72370 USA
Address Match 1243789 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 West Balor Ave Osceola AR 72370 USA
Address Match 1230895 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1243782 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1176587 Topsy Turvy 365 Welmington Road Chicago IL 60606 USA
Address Match 2156671 Topsy Turvvy, Inc. P.O. Box 55217 Columbus OH 43081 365 Welmington Road Chicago IL 60606 USA
Address Match 1241328 INFINITY AND BEYOND P.O. BOX 169 GASTONIA NC 28053-0269 USA
Address Match 1259522 MILES STONES PO BOX 169 GASSTONIA NC 28053-0269 USA
Address Match 1255253 AT&T PO Box 50221 Carol Stream IL 60197 USA
Address Match 1135513 AT&T PO Box 50221 Carol Stream IL 60197-5080 USA
解决方案
只需将row_number
每个供应商添加到最终结果集中并仅过滤掉第 1 行。
请注意,该row_number
函数需要一个order by
子句,用于确定您希望保留哪些重复行。改变它以适应你的情况。
WITH cte AS (
SELECT
'Address Match' AS [Reason],
pv.[Supplier_No],
pv.[Name],
pv.[Address],
pv.[City],
pv.[State],
pv.[Zip],
pv.[Country],
pv.[Remit_Address],
pv.[Remit_City],
pv.[Remit_State],
pv.[Remit_Zip],
pv.[Remit_Country]
, ROW_NUMBER() OVER (PARTITION BY pv.[Supplier_No] ORDER BY pv.[Name]) rn
FROM dupe_addresses n
LEFT JOIN sample_data pv
ON (
(n.[NewAdd] = pv.[cleanAddress] AND ( [Address] <> '' AND [Address] IS NOT NULL ))
OR (n.[NewAdd] = pv.[cleanRemit_Address] AND ( [Remit_Address] <> '' AND [Remit_Address] IS NOT NULL))
)
WHERE ([Supplier_No] IS NOT NULL AND [Supplier_No] <> '')
)
SELECT *
FROM cte
WHERE rn = 1
ORDER BY Supplier_No, [Name];
推荐阅读
- typescript - 要求语句不是导入语句的一部分。eslint@typescript-eslint/no-var-需要从存储中导入打字稿
- haskell - Haskell中的斐波那契冻结电脑?
- c - 检查两个无符号整数之间的差是否最多为 1
- java - 流畅的 try-catch 块
- python - 如何使用单个 python(.py) 脚本运行多个 python 脚本
- bash - 从 bash 返回值到 Ansible 任务
- r - 跨多列匹配行但忽略 Rstudio 中的 NA
- javascript - 使用 rxjs angular 从具有特定属性的对象数组中获取对象数组
- android - 如何从 Firebase Firestore 获取自定义对象?
- excel - 有没有办法转置某些列,同时保留其他列并在 Excel 中复制它们?