首页 > 解决方案 > 根据匹配的客户字段(电话、电子邮件、地址)为客户分配 home_id - 重复问题

问题描述

我有一个客户表,其中每个客户都有一个唯一id依赖于他们下订单的电子邮件。此外,还有用于phoneemail和的单独表格列addresshousehold_id如果他们的phone,emailaddress匹配客户表中的另一个“客户”,我正在尝试将客户 ID 分组在一起。

我遇到的问题是我可以将客户分组在一起并给他们一个household_id,但我正在努力在我的过滤中完全删除这些客户分组的重复出现。下面查询中的最后两条评论旨在帮助解释我当前的过滤逻辑并说明它失败的地方。此逻辑适用于成对的客户,但一旦需要将 3 个或更多客户绑定到相同的单个 home_id,就会开始失败。有没有更好的方法来过滤这些结果,或者我是否需要添加一些额外的 CTE 来利用 min()/max() 函数和其他类型的分组来在这里添加更多智能?除了 rank() 之外,还有什么其他聪明的窗口函数可以帮助我吗?

with household as (
  select
    c1.id as parent_id,
    c2.id as child_id,
    rank() over (partition by c1.id order by c2.id) as child_number
    -- order by clause is important here to ensure lowest c2.id is always rank 1 (referenced later on in household join onto customer table)
                    
  from customer c1
    left join customer c2 on (c1.phone = c2.phone) or (c1.email = c2.email) or (c1.address = c2.address)
                            
  order by c1.id, child_number
)
                
select
  'H-' || h.parent_id as household_id, -- effectively creates a unique household_id
  h.child_id
                    
from household h
  where h.parent_id < h.child_id or (h.parent_id = h.child_id and h.child_number = 1)
  -- ^this where clause is my attempt at removing the duplicate groupings of customers
  -- it works in the instance when there is a pair of customers tied to a household_id, but when there are 3 or more it starts to fail

请参阅链接图片以查看家庭 cte 的视图,其中包含 3 个 customer_id 的分组,因为它们具有匹配的电话、电子邮件或地址而连接在一起。突出显示的行是在上述查询的 where 子句中通过我的过滤器的内容 我的查询
如何失败

标签: sqlpostgresqlwindow-functions

解决方案


我使用与 Julius 评论中讨论的方法类似的方法解决了这个问题:

  1. 使用递归 cte 将所有类似的客户分组到数组中grp
  2. 将递归修剪grp为仅包含所有家庭成员使用的数组以及cte中select distinct oncardinality()函数household
  3. 扫描表格并通过取消嵌套数组并根据数组中的最小 id 将其所有元素绑定到单个元素来household删除任何重复项linked_customershousehold_id

这似乎工作得很好,但我确信这个查询可以进一步简化,我很乐意欢迎任何反馈!

with recursive grp as (
    select
        c1.id,
        c1.email,
        c1.phone,
        c1.address,
        array[c1.id] as linked_customers -- initializes an array based of the id of every customer
    from customer c1
    
    union all
    
    select
        c2.id,
        c2.email,
        c2.phone,
        c2.address,
        c2.id || linked_customers
    from grp g
        join customer c2 on (g.email = c2.email or g.phone = c2.phone or g.address = c2.address)
    where c2.id <> all(linked_customers) -- ensures the same customer id that already was used in the array initialization is not being looked at again
), -- creates several similar groups as well as intermediary groups of customers

household as (
    select
        distinct on (g.id) g.linked_customers
    from grp g
    order by g.id, cardinality(linked_customers) desc
) -- extracts largest array for each customer_id (still duplicate groupings here), but only the max length arrays are being pulled out (all household members)

select
    distinct on (p.parent_id) 'H-' || parent_id as household_id,
    unnest(p.linked_customers) as child_id
from (
    select
        min(parent_id) as parent_id, -- pulls out the minimum id of each linked customers group which will remove the creation of multiple household_ids for the same customer groups in the select clause above
        h.linked_customers
    from household h, unnest(h.linked_customers) parent_id
    group by h.linked_customers
) p
order by parent_id

推荐阅读