sql - 按相似性对行进行分组
问题描述
我正在使用 SQL 服务器。
我有下表:
对于每个 BIGroup,我都有多个 VarianceName。对于每个 VarianceName,我有多个 PartNumber。我将每个零件编号与同一 BIGroup 和 VarianceName 中的其他零件编号进行比较,并在差异列中写入 PartNumber1 和 PartNumber2 之间的差异数量:
+---------+--------------+-------------+-------------+------------+-----------+
| BIGroup | VarianceName | PartNumber1 | PartNumber2 | Difference | Cluster |
+---------+--------------+-------------+-------------+------------+-----------+
| D934 | A | 11426777 | 11426777 | 0 | |
| D934 | A | 11426777 | 11426781 | 0 | |
| D934 | A | 11426777 | 12542804 | 2 | |
| D934 | A | 11426777 | 12554759 | 4 | |
| D934 | A | 11426777 | 12564258 | 0 | |
| D934 | A | 11426781 | 11426777 | 0 | |
| D934 | A | 11426781 | 11426781 | 0 | |
| D934 | A | 11426781 | 12542804 | 5 | |
| D934 | A | 11426781 | 12554759 | 1 | |
| D934 | A | 11426781 | 12564258 | 0 | |
| D934 | A | 12542804 | 11426777 | 2 | |
| D934 | A | 12542804 | 11426781 | 5 | |
| D934 | A | 12542804 | 12542804 | 0 | |
| D934 | A | 12542804 | 12554759 | 0 | |
| D934 | A | 12542804 | 12564258 | 8 | |
| D934 | A | 12554759 | 11426777 | 4 | |
| D934 | A | 12554759 | 11426781 | 1 | |
| D934 | A | 12554759 | 12542804 | 0 | |
| D934 | A | 12554759 | 12554759 | 0 | |
| D934 | A | 12554759 | 12564258 | 9 | |
| D934 | A | 12564258 | 11426777 | 0 | |
| D934 | A | 12564258 | 11426781 | 0 | |
| D934 | A | 12564258 | 12542804 | 8 | |
| D934 | A | 12564258 | 12554759 | 9 | |
| D934 | A | 12564258 | 12564258 | 0 | |
| D934 | AA | 11438878 | 11438878 | 0 | |
| D934 | AB | 11438924 | 11438924 | 0 | |
| D934 | AC | 12556213 | 12556213 | 0 | |
| D934 | AC | 12556213 | 12556214 | 5 | |
| D934 | AC | 12556214 | 12556213 | 5 | |
| D934 | AC | 12556214 | 12556214 | 0 | |
| D955 | A | 75346846 | 75346846 | 0 | |
| ... | ... | ... | ... | 0 | |
+---------+--------------+-------------+-------------+------------+-----------+
例如:对于 D934,对于 VarianceName A,PartNumbers 11426777、11426781 和 12564258 是相同的,因为:11426777 和 11426781、11426781 和 12564258、12564258 和 11426777 之间存在 0 个差异。
例如:对于 D934,对于 VarianceName A,PartNumbers 12542804 和 12554759 相同,因为:12542804 和 12554759 之间有 0 个差异。
我的目标是识别同一 BIGroup 和 VarianceName 中的所有相同 PartNumber 组。 为了标记这些组,我将使用名为Cluster的列。
所以 11426777、11426781 和 12564258 将属于集群 D934-A-C1。
所以 12542804 和 12554759 将属于集群 D934-A-C2。
更新 Cluster 列的查询/存储过程应该是什么,以获得以下结果:
+---------+--------------+-------------+-------------+------------+-----------+
| BIGroup | VarianceName | PartNumber1 | PartNumber2 | Difference | Cluster |
+---------+--------------+-------------+-------------+------------+-----------+
| D934 | A | 11426777 | 11426777 | 0 | D934-A-C1 |
| D934 | A | 11426777 | 11426781 | 0 | D934-A-C1 |
| D934 | A | 11426777 | 12542804 | 2 | |
| D934 | A | 11426777 | 12554759 | 4 | |
| D934 | A | 11426777 | 12564258 | 0 | D934-A-C1 |
| D934 | A | 11426781 | 11426777 | 0 | D934-A-C1 |
| D934 | A | 11426781 | 11426781 | 0 | D934-A-C1 |
| D934 | A | 11426781 | 12542804 | 5 | |
| D934 | A | 11426781 | 12554759 | 1 | |
| D934 | A | 11426781 | 12564258 | 0 | D934-A-C1 |
| D934 | A | 12542804 | 11426777 | 2 | |
| D934 | A | 12542804 | 11426781 | 5 | |
| D934 | A | 12542804 | 12542804 | 0 | D934-A-C2 |
| D934 | A | 12542804 | 12554759 | 0 | D934-A-C2 |
| D934 | A | 12542804 | 12564258 | 8 | |
| D934 | A | 12554759 | 11426777 | 4 | |
| D934 | A | 12554759 | 11426781 | 1 | |
| D934 | A | 12554759 | 12542804 | 0 | D934-A-C2 |
| D934 | A | 12554759 | 12554759 | 0 | D934-A-C2 |
| D934 | A | 12554759 | 12564258 | 9 | |
| D934 | A | 12564258 | 11426777 | 0 | D934-A-C1 |
| D934 | A | 12564258 | 11426781 | 0 | D934-A-C1 |
| D934 | A | 12564258 | 12542804 | 8 | |
| D934 | A | 12564258 | 12554759 | 9 | |
| D934 | A | 12564258 | 12564258 | 0 | D934-A-C1 |
其他 VarianceName 依此类推
| D934 | AA | 11438878 | 11438878 | 0 | D934-AA-C1
| D934 | AB | 11438924 | 11438924 | 0 | D934-AB-C1
| D934 | AC | 12556213 | 12556213 | 0 | D934-AC-C1
| D934 | AC | 12556213 | 12556214 | 5 |
| D934 | AC | 12556214 | 12556213 | 5 |
| D934 | AC | 12556214 | 12556214 | 0 | D934-AC-C1
其他 BiGroup 以此类推
| D955 | A | 75346846 | 75346846 | 0 | D955-A-C1
| ... | ... | ... | ... | ... |
+---------+--------------+-------------+-------------+------------+-----------+
如果差异 > 0,则该列应保留为 NULL
这是将数据作为 cte 的脚本:
with t1 as
(
select 'D934' as BIGroup ,'A' as VarianceName , 11426777 as PartNumber1, 11426777 as PartNumber2, 0 as Difference, null as Cluster
union select 'D934' ,'A' , 11426777 , 11426781 , 0 , null
union select 'D934' ,'A' , 11426777 , 12542804 , 2 , null
union select 'D934' ,'A' , 11426777 , 12554759 , 4 , null
union select 'D934' ,'A' , 11426777 , 12564258 , 0 , null
union select 'D934' ,'A' , 11426781 , 11426777 , 0 , null
union select 'D934' ,'A' , 11426781 , 11426781 , 0 , null
union select 'D934' ,'A' , 11426781 , 12542804 , 5 , null
union select 'D934' ,'A' , 11426781 , 12554759 , 1 , null
union select 'D934' ,'A' , 11426781 , 12564258 , 0 , null
union select 'D934' ,'A' , 12542804 , 11426777 , 2 , null
union select 'D934' ,'A' , 12542804 , 11426781 , 5 , null
union select 'D934' ,'A' , 12542804 , 12542804 , 0 , null
union select 'D934' ,'A' , 12542804 , 12554759 , 0 , null
union select 'D934' ,'A' , 12542804 , 12564258 , 8 , null
union select 'D934' ,'A' , 12554759 , 11426777 , 4 , null
union select 'D934' ,'A' , 12554759 , 11426781 , 1 , null
union select 'D934' ,'A' , 12554759 , 12542804 , 0 , null
union select 'D934' ,'A' , 12554759 , 12554759 , 0 , null
union select 'D934' ,'A' , 12554759 , 12564258 , 9 , null
union select 'D934' ,'A' , 12564258 , 11426777 , 0 , null
union select 'D934' ,'A' , 12564258 , 11426781 , 0 , null
union select 'D934' ,'A' , 12564258 , 12542804 , 8 , null
union select 'D934' ,'A' , 12564258 , 12554759 , 9 , null
union select 'D934' ,'A' , 12564258 , 12564258 , 0 , null
union select 'D934' ,'AA' , 11438878 , 11438878 , 0 , null
union select 'D934' ,'AB' , 11438924 , 11438924 , 0 , null
union select 'D934' ,'AC' , 12556213 , 12556213 , 0 , null
union select 'D934' ,'AC' , 12556213 , 12556214 , 5 , null
union select 'D934' ,'AC' , 12556214 , 12556213 , 5 , null
union select 'D934' ,'AC' , 12556214 , 12556214 , 0 , null
union select 'D955' ,'A' , 75346846 , 75346846 , 0 , null
)
编辑:
为了更好地理解问题,我画了 5 个部件号D934
A
、它们的链接和两个集群。
我们感兴趣的链接是黑色的(因为这意味着partNumbers之间的差异为0)。
橙色链接表示零件编号之间的差异>0。
绘制链接后,我们可以识别 2 个集群,我用红色圆圈绘制了这些集群。
解决方案
我设法用一个存储过程解决了这个问题:
DECLARE @BiGroup [nvarchar](30);
DECLARE @VarianceName [nvarchar](30);
DECLARE @NewBiGroup [nvarchar](30);
DECLARE @NewVarianceName [nvarchar](30);
DECLARE @PartNumber [nvarchar](30);
DECLARE @ClusterName [nvarchar](30);
DECLARE @IncrementClusterName [nvarchar](30);
set @BiGroup = 'first_BiGroup';
set @VarianceName = 'first_VarianceName';
set @IncrementClusterName = 1;
set @ClusterName = null;
-- Declare cursor
DECLARE cur CURSOR READ_ONLY FOR
Select [PartNumber1] FROM t1
order by [BIGroup] ,[VarianceName] ,[PartNumber1];
--clean cluster column
update t1 set [Cluster]=null;
OPEN cur
FETCH NEXT FROM cur INTO @PartNumber
-- Loop on every PartNumber
WHILE @@FETCH_STATUS = 0
BEGIN
--set NewBiGroup and NewPartNumber
set @NewBiGroup = (select Top(1) [BIGroup] from t1 where partnumber1 = @PartNumber);
set @NewVarianceName = (select Top(1) [VarianceName] from t1 where partnumber1 = @PartNumber);
--check if we are still in the same BIGroup and Variance, otherwise, reset the cluster increment
if @NewBiGroup <> @BiGroup or @NewVarianceName <> @VarianceName
BEGIN
set @IncrementClusterName = 1;
END
--get the clusterName of this partNumber, if it exists
set @ClusterName = (select Top(1) [Cluster] from t1 where partnumber2 = @PartNumber and [Cluster] is not null);
--if ClusterName is NULL, put a clustername and then increment the @IncrementClusterName,
--otherwise set the cluster to @ClusterName
if @ClusterName is null
BEGIN
update t1 set [Cluster] = @NewBiGroup+'-'+@NewVarianceName+'-'+@IncrementClusterName
where partnumber1 = @PartNumber
and Difference= 0 ;
set @IncrementClusterName = @IncrementClusterName +1;
END
else
BEGIN
update t1 set [Cluster] = @NewBiGroup+'-'+@NewVarianceName+'-'+@ClusterName
where partnumber1 = @PartNumber
and Difference= 0 ;
END
-- setting the BiGroup and VarianceName
set @BiGroup = @NewBiGroup;
set @VarianceName = @NewVarianceName;
FETCH NEXT FROM cur INTO @PartNumber
END
CLOSE cur
DEALLOCATE cur
存储过程的算法是这样工作的:
- 对于每个
PartNumber
差异=0- 如果
BiGroup
或VarianceName
已更改- 我重置
@clusterIncrement
为 1
- 我重置
- 如果他还不是一个
Cluster
- 我将他设置
Cluster
为@clusterIncrement
@clusterIncrement = @clusterIncrement +1
- 我将他设置
- 如果他已经是一个
Cluster
- 我将他设置
Cluster
为现有的Cluster
- 我将他设置
- 如果
推荐阅读
- c - 如何在 LLVM C 中将 int 转换为浮点数?
- google-sheets - 谷歌表格公式根据条件添加列名作为列条目
- reactjs - Redux mapStateToProps 到类反应组件
- java - 在加入操作中使用实体 ID 时调用 JPA PreUpdate
- azure-active-directory - 有没有办法在不使用 AAD 作为身份验证的情况下为客户嵌入 Power Bi 报告?
- groovy - UnetStack:如何在两个调制解调器之间自动传输数据
- regression - 我应该如何格式化 logit 回归来测试公司的动机?(右)
- regex - VBA 正则表达式 REGEX : N.NN% 或 N.NNNN
- powershell - 使用 powershell 删除 Skype for business
- javascript - 将数据发送到 php 以从数据库中重新运行数据