首页 > 解决方案 > 减少多值连接 SQL 查询中的重复记录

问题描述

背景故事:我创建了一堆分析客户数据的存储过程。我正在查看供应商列表并尝试找出可能的重复项。它工作得很好,但每条记录都有 2 个可能的地址,当在两个地址中找到匹配项时,我会得到重复的结果。理想情况下,我只需要记录在结果中出现一次。

流程:我创建了一个“干净”版本的地址,我在其中删除了特殊字符并标准化为 USPS 标准。这有助于我匹配 West v W v W. 或 PO Box v PO Box v PO Box 等。然后我从两个地址 ([cleanAddress][cleanRemit_Address]) 中获取所有不同的地址值并放入主列表。然后,我用 a 与源表进行比较,HAVING COUNT(*) > 1以确定哪些地址出现了多次。最后,我将出现多次的最终地址列表与源数据结合起来进行输出。

问题:如果您查看底部附近的结果,您会发现我有 2 组骗子,它们几乎相同,只是地址略有不同。Address和本质上是相同的,因此它会在“SouthWestern Medical”和“NERO CO”的 BOTH和值Remit_Address上找到匹配项,因此两组骗子在列表中出现两次而不是一次(请参阅底部的所需结果)。我需要匹配OR但我不知道如何限制每条记录在结果中出现一次。[cleanAddress][cleanRemit_Address][cleanAddress][cleanRemit_Address]

查询:

--SQL (Address List): Combines all addresses for a master list of all addresses in the table
SELECT * INTO [address_list] FROM (
    SELECT DISTINCT [NewAdd] FROM 
      (
        SELECT  DISTINCT [cleanAddress] AS [NewAdd]
          FROM [sample_data]
          WHERE 
            ( [cleanAddress] IS NOT NULL AND [cleanAddress] <> '' ) AND
            ( [Supplier_No]  IS NOT NULL  AND [Supplier_No] <> '' )
          GROUP BY [cleanAddress] 

        UNION 

        SELECT  DISTINCT [cleanRemit_Address] AS [NewAdd]
          FROM [sample_data]
          WHERE 
            ( [cleanRemit_Address] IS NOT NULL AND [cleanRemit_Address] <> '' ) AND
            ( [Supplier_No]  IS NOT NULL  AND [Supplier_No] <> '' )
          GROUP BY [cleanRemit_Address]
      ) q1
  ) q2 
ORDER BY 
  [NewAdd] 
  


--SQL (Address Dupes): Determines which addresses appear in the data more than once
SELECT * INTO [dupe_addresses] FROM (
    SELECT [NewAdd]
      FROM [address_list] n 
      LEFT JOIN [sample_data] pv ON 
      (
        ( n.[NewAdd] = pv.[cleanAddress]       AND ( [Address]       <> '' AND [Address]       IS NOT NULL ) )  OR
        ( n.[NewAdd] = pv.[cleanRemit_Address] AND ( [Remit_Address] <> '' AND [Remit_Address] IS NOT NULL ) )
      ) 
    WHERE 
      ( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' ) 
    GROUP BY [NewAdd] 
    HAVING COUNT(*) > 1
  ) q1 
ORDER BY [NewAdd] 

 

--SQL (Address Query): Outputs the information of the matched addresses
SELECT 
  'Address Match' AS [Reason], 
  pv.[Supplier_No], 
  pv.[Name], 
  pv.[Address], 
  pv.[City], 
  pv.[State], 
  pv.[Zip], 
  pv.[Country], 
  pv.[Remit_Address], 
  pv.[Remit_City], 
  pv.[Remit_State], 
  pv.[Remit_Zip], 
  pv.[Remit_Country]
FROM 
  [dupe_addresses] n 
  LEFT JOIN [sample_data] pv 
  ON (
    (n.[NewAdd] = pv.[cleanAddress] AND ( [Address] <> '' AND [Address] IS NOT NULL ) ) 
  OR 
    (n.[NewAdd] = pv.[cleanRemit_Address] AND ( [Remit_Address] <> '' AND [Remit_Address] IS NOT NULL ) )
  ) 
WHERE 
  ( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' ) 

样本数据:

CREATE TABLE [sample_data] (
    [Supplier_No]           varchar(255),
    [Name]                  varchar(255),
    [Address]               varchar(255),
    [City]                  varchar(255),
    [State]                 varchar(255),
    [Zip]                   varchar(255),
    [Country]               varchar(255),
    [Remit_Address]         varchar(255),
    [Remit_City]            varchar(255),
    [Remit_State]           varchar(255),
    [Remit_Zip]             varchar(255),
    [Remit_Country]         varchar(255),
    [cleanAddress]          varchar(255),
    [cleanRemit_Address]    varchar(255),
    CONSTRAINT [suppliers_pk] PRIMARY KEY ([Supplier_No])
)

INSERT INTO [sample_data] VALUES
    ('1039104','Geez Companies','100 Aero Hudson Rd','Streetsboro','OH','44241','','100 Aero Hudson Road','Streetsboro','OH','44241','USA','100 Aero Hudson Rd','100 Aero Hudson Rd'),
    ('1218409','SouthWestern Medical','100 West Balor Ave','Osceola','AR','72370','USA','SouthWestern Medical100 W Balor Ave','Osceola','AR','72370','USA','100 W Balor Ave','SouthWestern Medical100 W Balor Ave'),
    ('1243789','SouthWestern Medical','100 West Balor Ave','Osceola','AR','72370','USA','SouthWestern Medical100 West Balor Ave','Osceola','AR','72370','USA','100 W Balor Ave','SouthWestern Medical100 W Balor Ave'),
    ('1243636','SIRI SYSTEMS','15 BRAD ROAD','WEXFORD','PA','15090','','','','','','','15 BRAD RD',''),
    ('1152482','FLEETWOOD MACK','22 WINDSOCK CT','ADDISON','IL','60101','','PO BOX 951','CHICAGO','IL','60694-5124','','22 WINDSOCK CT','PO BOX 951'),
    ('1224483','Aerospace Junction','211500 Communicate Ave','Mingo Junction','OH','43939','USA','P O Box 99','Mingo Junction','OH','43939','USA','211500 Communicate Ave','PO Box 99'),
    ('1243397','Squeezy Felt','SCHREIBER DIST','NEW KENSINGTON','PA','15068','','','','','','','SCHREIBER DIST',''),
    ('1230895','NERO CO','28 North US State Highway 99','Osceola','AR','72370','USA','PO Box 204','Cape Girardeau','MO','63702-2045','USA','28 N US State Hwy 99','PO Box 204'),
    ('1243782','NERO CO','28 North US State Highway 99','Osceola','AR','72370','USA','PO Box 204','Cape Girardeau','MO','63702-2045','USA','28 N US State Hwy 99','PO Box 204'),
    ('1135880','RICHARD PRYOR SEMINARS','PO BOX 2194','KANSAS CITY','MO','64121-9468','USA','RICHARD PRYOR SEMINARS P O BOX 2194','KANSAS CITY','MO','64121-9468','USA','PO BOX 2194','RICHARD PRYOR SEMINARS PO BOX 2194'),
    ('1241328','INFINITY AND BEYOND','P.O. BOX 169','GASTONIA','NC','28053-0269','USA','','','','','','PO BOX 169',''),
    ('1259522','MILES STONES','PO BOX 169','GASSTONIA','NC','28053-0269','USA','','','','','','PO BOX 169',''),
    ('1255253','AT&T','PO Box 50221','Carol Stream','IL','60197','USA','','','','','','PO Box 50221',''),
    ('1135513','AT&T','PO Box 50221','Carol Stream','IL','60197-5080','USA','','','','','','PO Box 50221',''),
    ('1119161','Machine Co, Inc','3306 N Thorne Blvd','Chattanooga','TN','','','PO BOX 5301','CHATTANOOGA','TN','37406','USA','3306 N Thorne Blvd','PO BOX 5301'),
    ('1176587','Topsy Turvy','365 Welmington Road','Chicago','IL','60606','USA','','','','','','365 Welmington Rd',''),
    ('2156671','Topsy Turvvy, Inc.','P.O. Box 55217','Columbus','OH','43081','','365 Welmington Road','Chicago','IL','60606','USA','','365 Welmington Rd')

当前结果:

Reason  Supplier_No Name    Address City    State   Zip Country Remit_Address   Remit_City  Remit_State Remit_Zip   Remit_Country
Address Match   1218409 SouthWestern Medical    100 West Balor Ave  Osceola AR  72370   USA SouthWestern Medical100 W Balor Ave Osceola AR  72370   USA
Address Match   1243789 SouthWestern Medical    100 West Balor Ave  Osceola AR  72370   USA SouthWestern Medical100 West Balor Ave  Osceola AR  72370   USA
Address Match   1230895 NERO CO 28 North US State Highway 99    Osceola AR  72370   USA PO Box 204  Cape Girardeau  MO  63702-2045  USA
Address Match   1243782 NERO CO 28 North US State Highway 99    Osceola AR  72370   USA PO Box 204  Cape Girardeau  MO  63702-2045  USA
Address Match   1176587 Topsy Turvy 365 Welmington Road Chicago IL  60606   USA                 
Address Match   2156671 Topsy Turvvy, Inc.  P.O. Box 55217  Columbus    OH  43081       365 Welmington Road Chicago IL  60606   USA
Address Match   1241328 INFINITY AND BEYOND P.O. BOX 169    GASTONIA    NC  28053-0269  USA                 
Address Match   1259522 MILES STONES    PO BOX 169  GASSTONIA   NC  28053-0269  USA                 
Address Match   1230895 NERO CO 28 North US State Highway 99    Osceola AR  72370   USA PO Box 204  Cape Girardeau  MO  63702-2045  USA
Address Match   1243782 NERO CO 28 North US State Highway 99    Osceola AR  72370   USA PO Box 204  Cape Girardeau  MO  63702-2045  USA
Address Match   1255253 AT&T    PO Box 50221    Carol Stream    IL  60197   USA                 
Address Match   1135513 AT&T    PO Box 50221    Carol Stream    IL  60197-5080  USA                 
Address Match   1218409 SouthWestern Medical    100 West Balor Ave  Osceola AR  72370   USA Southern Lawn Care1004 W Hale Ave   Osceola AR  72370   USA
Address Match   1243789 SouthWestern Medical    100 West Balor Ave  Osceola AR  72370   USA SouthWestern Medical100 West Balor Ave  Osceola AR  72370   USA

期望的结果:

Reason  Supplier_No Name    Address City    State   Zip Country Remit_Address   Remit_City  Remit_State Remit_Zip   Remit_Country
Address Match   1218409 SouthWestern Medical    100 West Balor Ave  Osceola AR  72370   USA SouthWestern Medical100 W Balor Ave Osceola AR  72370   USA
Address Match   1243789 SouthWestern Medical    100 West Balor Ave  Osceola AR  72370   USA SouthWestern Medical100 West Balor Ave  Osceola AR  72370   USA
Address Match   1230895 NERO CO 28 North US State Highway 99    Osceola AR  72370   USA PO Box 204  Cape Girardeau  MO  63702-2045  USA
Address Match   1243782 NERO CO 28 North US State Highway 99    Osceola AR  72370   USA PO Box 204  Cape Girardeau  MO  63702-2045  USA
Address Match   1176587 Topsy Turvy 365 Welmington Road Chicago IL  60606   USA                 
Address Match   2156671 Topsy Turvvy, Inc.  P.O. Box 55217  Columbus    OH  43081       365 Welmington Road Chicago IL  60606   USA
Address Match   1241328 INFINITY AND BEYOND P.O. BOX 169    GASTONIA    NC  28053-0269  USA                 
Address Match   1259522 MILES STONES    PO BOX 169  GASSTONIA   NC  28053-0269  USA                 
Address Match   1255253 AT&T    PO Box 50221    Carol Stream    IL  60197   USA                 
Address Match   1135513 AT&T    PO Box 50221    Carol Stream    IL  60197-5080  USA                 

标签: sqlsql-servertsqljoin

解决方案


只需将row_number每个供应商添加到最终结果集中并仅过滤掉第 1 行。

请注意,该row_number函数需要一个order by子句,用于确定您希望保留哪些重复行。改变它以适应你的情况。

WITH cte AS (
    SELECT 
        'Address Match' AS [Reason], 
        pv.[Supplier_No], 
        pv.[Name], 
        pv.[Address], 
        pv.[City], 
        pv.[State], 
        pv.[Zip], 
        pv.[Country], 
        pv.[Remit_Address], 
        pv.[Remit_City], 
        pv.[Remit_State], 
        pv.[Remit_Zip], 
        pv.[Remit_Country]
        , ROW_NUMBER() OVER (PARTITION BY pv.[Supplier_No] ORDER BY pv.[Name]) rn
    FROM dupe_addresses n 
    LEFT JOIN sample_data pv 
        ON (
            (n.[NewAdd] = pv.[cleanAddress] AND ( [Address] <> '' AND [Address] IS NOT NULL ))
            OR (n.[NewAdd] = pv.[cleanRemit_Address] AND ( [Remit_Address] <> '' AND [Remit_Address] IS NOT NULL))
        ) 
    WHERE ([Supplier_No] IS NOT NULL AND [Supplier_No] <> '') 
)
SELECT *
FROM cte
WHERE rn = 1
ORDER BY Supplier_No, [Name];

推荐阅读