sql - 根据匹配的客户字段(电话、电子邮件、地址)为客户分配 home_id - 重复问题
问题描述
我有一个客户表,其中每个客户都有一个唯一id
依赖于他们下订单的电子邮件。此外,还有用于phone
、email
和的单独表格列address
。household_id
如果他们的phone
,email
或address
匹配客户表中的另一个“客户”,我正在尝试将客户 ID 分组在一起。
我遇到的问题是我可以将客户分组在一起并给他们一个household_id
,但我正在努力在我的过滤中完全删除这些客户分组的重复出现。下面查询中的最后两条评论旨在帮助解释我当前的过滤逻辑并说明它失败的地方。此逻辑适用于成对的客户,但一旦需要将 3 个或更多客户绑定到相同的单个 home_id,就会开始失败。有没有更好的方法来过滤这些结果,或者我是否需要添加一些额外的 CTE 来利用 min()/max() 函数和其他类型的分组来在这里添加更多智能?除了 rank() 之外,还有什么其他聪明的窗口函数可以帮助我吗?
with household as (
select
c1.id as parent_id,
c2.id as child_id,
rank() over (partition by c1.id order by c2.id) as child_number
-- order by clause is important here to ensure lowest c2.id is always rank 1 (referenced later on in household join onto customer table)
from customer c1
left join customer c2 on (c1.phone = c2.phone) or (c1.email = c2.email) or (c1.address = c2.address)
order by c1.id, child_number
)
select
'H-' || h.parent_id as household_id, -- effectively creates a unique household_id
h.child_id
from household h
where h.parent_id < h.child_id or (h.parent_id = h.child_id and h.child_number = 1)
-- ^this where clause is my attempt at removing the duplicate groupings of customers
-- it works in the instance when there is a pair of customers tied to a household_id, but when there are 3 or more it starts to fail
请参阅链接图片以查看家庭 cte 的视图,其中包含 3 个 customer_id 的分组,因为它们具有匹配的电话、电子邮件或地址而连接在一起。突出显示的行是在上述查询的 where 子句中通过我的过滤器的内容 我的查询
如何失败
解决方案
我使用与 Julius 评论中讨论的方法类似的方法解决了这个问题:
- 使用递归 cte 将所有类似的客户分组到数组中
grp
- 将递归修剪
grp
为仅包含所有家庭成员使用的数组以及cte中select distinct on
的cardinality()
函数household
- 扫描表格并通过取消嵌套数组并根据数组中的最小 id 将其所有元素绑定到单个元素来
household
删除任何重复项linked_customers
household_id
这似乎工作得很好,但我确信这个查询可以进一步简化,我很乐意欢迎任何反馈!
with recursive grp as (
select
c1.id,
c1.email,
c1.phone,
c1.address,
array[c1.id] as linked_customers -- initializes an array based of the id of every customer
from customer c1
union all
select
c2.id,
c2.email,
c2.phone,
c2.address,
c2.id || linked_customers
from grp g
join customer c2 on (g.email = c2.email or g.phone = c2.phone or g.address = c2.address)
where c2.id <> all(linked_customers) -- ensures the same customer id that already was used in the array initialization is not being looked at again
), -- creates several similar groups as well as intermediary groups of customers
household as (
select
distinct on (g.id) g.linked_customers
from grp g
order by g.id, cardinality(linked_customers) desc
) -- extracts largest array for each customer_id (still duplicate groupings here), but only the max length arrays are being pulled out (all household members)
select
distinct on (p.parent_id) 'H-' || parent_id as household_id,
unnest(p.linked_customers) as child_id
from (
select
min(parent_id) as parent_id, -- pulls out the minimum id of each linked customers group which will remove the creation of multiple household_ids for the same customer groups in the select clause above
h.linked_customers
from household h, unnest(h.linked_customers) parent_id
group by h.linked_customers
) p
order by parent_id
推荐阅读
- arrays - 如何通过遍历散列数组并一次考虑 2 个散列来删除键?
- ruby - 为什么 Ruby hash 不是 last 方法?
- c# - 通过N个字符到N个字符获取字符串的一部分
- listener - 带有 JButton 动作侦听器的 NullPointerException
- python - 扁平化 json 文档,多次出现堆叠
- node.js - 无法在基于 Node ExpressJS 服务器的服务器发送事件中检索 req.body 的值?
- javascript - 使用 eventpreventDefault 获取无法设置属性“innerHTML”为空
- javascript - 使用 JQuery 在列表元素中获取动态类名?
- c++ - 如何计算小数点后3位的差异
- r - 如何使用闪亮制作交互式图例标签?