首页 > 解决方案 > 按相似性对行进行分组

问题描述

我正在使用 SQL 服务器。

我有下表:

对于每个 BIGroup,我都有多个 VarianceName。对于每个 VarianceName,我有多个 PartNumber。我将每个零件编号与同一 BIGroup 和 VarianceName 中的其他零件编号进行比较,并在差异列中写入 PartNumber1 和 PartNumber2 之间的差异数量:

+---------+--------------+-------------+-------------+------------+-----------+
| BIGroup | VarianceName | PartNumber1 | PartNumber2 | Difference |  Cluster  |
+---------+--------------+-------------+-------------+------------+-----------+
| D934    | A            |    11426777 |    11426777 |          0 |           |
| D934    | A            |    11426777 |    11426781 |          0 |           |
| D934    | A            |    11426777 |    12542804 |          2 |           |
| D934    | A            |    11426777 |    12554759 |          4 |           |
| D934    | A            |    11426777 |    12564258 |          0 |           |
| D934    | A            |    11426781 |    11426777 |          0 |           |
| D934    | A            |    11426781 |    11426781 |          0 |           |
| D934    | A            |    11426781 |    12542804 |          5 |           |
| D934    | A            |    11426781 |    12554759 |          1 |           |
| D934    | A            |    11426781 |    12564258 |          0 |           |
| D934    | A            |    12542804 |    11426777 |          2 |           |
| D934    | A            |    12542804 |    11426781 |          5 |           |
| D934    | A            |    12542804 |    12542804 |          0 |           |
| D934    | A            |    12542804 |    12554759 |          0 |           |
| D934    | A            |    12542804 |    12564258 |          8 |           |
| D934    | A            |    12554759 |    11426777 |          4 |           |
| D934    | A            |    12554759 |    11426781 |          1 |           |
| D934    | A            |    12554759 |    12542804 |          0 |           |
| D934    | A            |    12554759 |    12554759 |          0 |           |
| D934    | A            |    12554759 |    12564258 |          9 |           |
| D934    | A            |    12564258 |    11426777 |          0 |           |
| D934    | A            |    12564258 |    11426781 |          0 |           |
| D934    | A            |    12564258 |    12542804 |          8 |           |
| D934    | A            |    12564258 |    12554759 |          9 |           |
| D934    | A            |    12564258 |    12564258 |          0 |           |
| D934    | AA           |    11438878 |    11438878 |          0 |           |
| D934    | AB           |    11438924 |    11438924 |          0 |           |
| D934    | AC           |    12556213 |    12556213 |          0 |           |
| D934    | AC           |    12556213 |    12556214 |          5 |           |
| D934    | AC           |    12556214 |    12556213 |          5 |           |
| D934    | AC           |    12556214 |    12556214 |          0 |           |
| D955    | A            |    75346846 |    75346846 |          0 |           |
| ...     | ...          |    ...      |    ...      |          0 |           |
+---------+--------------+-------------+-------------+------------+-----------+

例如:对于 D934,对于 VarianceName A,PartNumbers 11426777、11426781 和 12564258 是相同的,因为:11426777 和 11426781、11426781 和 12564258、12564258 和 11426777 之间存在 0 个差异。

例如:对于 D934,对于 VarianceName A,PartNumbers 12542804 和 12554759 相同,因为:12542804 和 12554759 之间有 0 个差异。

我的目标是识别同一 BIGroup 和 VarianceName 中的所有相同 PartNumber 组。 为了标记这些组,我将使用名为Cluster的列。

所以 11426777、11426781 和 12564258 将属于集群 D934-A-C1。

所以 12542804 和 12554759 将属于集群 D934-A-C2。

更新 Cluster 列的查询/存储过程应该是什么,以获得以下结果:

+---------+--------------+-------------+-------------+------------+-----------+
| BIGroup | VarianceName | PartNumber1 | PartNumber2 | Difference |  Cluster  |
+---------+--------------+-------------+-------------+------------+-----------+
| D934    | A            |    11426777 |    11426777 |          0 | D934-A-C1 |
| D934    | A            |    11426777 |    11426781 |          0 | D934-A-C1 |
| D934    | A            |    11426777 |    12542804 |          2 |           |
| D934    | A            |    11426777 |    12554759 |          4 |           |
| D934    | A            |    11426777 |    12564258 |          0 | D934-A-C1 |
| D934    | A            |    11426781 |    11426777 |          0 | D934-A-C1 |
| D934    | A            |    11426781 |    11426781 |          0 | D934-A-C1 |
| D934    | A            |    11426781 |    12542804 |          5 |           |
| D934    | A            |    11426781 |    12554759 |          1 |           |
| D934    | A            |    11426781 |    12564258 |          0 | D934-A-C1 |
| D934    | A            |    12542804 |    11426777 |          2 |           |
| D934    | A            |    12542804 |    11426781 |          5 |           |
| D934    | A            |    12542804 |    12542804 |          0 | D934-A-C2 |
| D934    | A            |    12542804 |    12554759 |          0 | D934-A-C2 |
| D934    | A            |    12542804 |    12564258 |          8 |           |
| D934    | A            |    12554759 |    11426777 |          4 |           |
| D934    | A            |    12554759 |    11426781 |          1 |           |
| D934    | A            |    12554759 |    12542804 |          0 | D934-A-C2 |
| D934    | A            |    12554759 |    12554759 |          0 | D934-A-C2 |
| D934    | A            |    12554759 |    12564258 |          9 |           |
| D934    | A            |    12564258 |    11426777 |          0 | D934-A-C1 |
| D934    | A            |    12564258 |    11426781 |          0 | D934-A-C1 |
| D934    | A            |    12564258 |    12542804 |          8 |           |
| D934    | A            |    12564258 |    12554759 |          9 |           |
| D934    | A            |    12564258 |    12564258 |          0 | D934-A-C1 |

其他 VarianceName 依此类推

| D934    | AA           |    11438878 |    11438878 |          0 | D934-AA-C1  
| D934    | AB           |    11438924 |    11438924 |          0 | D934-AB-C1
| D934    | AC           |    12556213 |    12556213 |          0 | D934-AC-C1
| D934    | AC           |    12556213 |    12556214 |          5 |
| D934    | AC           |    12556214 |    12556213 |          5 |
| D934    | AC           |    12556214 |    12556214 |          0 | D934-AC-C1

其他 BiGroup 以此类推

| D955    | A            |    75346846 |    75346846 |          0 | D955-A-C1
| ...     | ...          |    ...      |    ...      |        ... |
+---------+--------------+-------------+-------------+------------+-----------+

如果差异 > 0,则该列应保留为 NULL

这是将数据作为 cte 的脚本:

with t1 as
( 
select 'D934'  as BIGroup  ,'A'    as VarianceName  ,   11426777 as PartNumber1,   11426777 as PartNumber2,         0 as Difference,  null as Cluster        
union select 'D934'    ,'A'            ,   11426777 ,   11426781 ,         0 , null
union select 'D934'    ,'A'            ,   11426777 ,   12542804 ,         2 , null
union select 'D934'    ,'A'            ,   11426777 ,   12554759 ,         4 , null
union select 'D934'    ,'A'            ,   11426777 ,   12564258 ,         0 , null
union select 'D934'    ,'A'            ,   11426781 ,   11426777 ,         0 , null
union select 'D934'    ,'A'            ,   11426781 ,   11426781 ,         0 , null
union select 'D934'    ,'A'            ,   11426781 ,   12542804 ,         5 , null
union select 'D934'    ,'A'            ,   11426781 ,   12554759 ,         1 , null
union select 'D934'    ,'A'            ,   11426781 ,   12564258 ,         0 , null
union select 'D934'    ,'A'            ,   12542804 ,   11426777 ,         2 , null
union select 'D934'    ,'A'            ,   12542804 ,   11426781 ,         5 , null
union select 'D934'    ,'A'            ,   12542804 ,   12542804 ,         0 , null
union select 'D934'    ,'A'            ,   12542804 ,   12554759 ,         0 , null
union select 'D934'    ,'A'            ,   12542804 ,   12564258 ,         8 , null
union select 'D934'    ,'A'            ,   12554759 ,   11426777 ,         4 , null
union select 'D934'    ,'A'            ,   12554759 ,   11426781 ,         1 , null
union select 'D934'    ,'A'            ,   12554759 ,   12542804 ,         0 , null
union select 'D934'    ,'A'            ,   12554759 ,   12554759 ,         0 , null
union select 'D934'    ,'A'            ,   12554759 ,   12564258 ,         9 , null
union select 'D934'    ,'A'            ,   12564258 ,   11426777 ,         0 , null
union select 'D934'    ,'A'            ,   12564258 ,   11426781 ,         0 , null
union select 'D934'    ,'A'            ,   12564258 ,   12542804 ,         8 , null
union select 'D934'    ,'A'            ,   12564258 ,   12554759 ,         9 , null
union select 'D934'    ,'A'            ,   12564258 ,   12564258 ,         0 , null
union select 'D934'    ,'AA'           ,   11438878 ,   11438878 ,         0 , null
union select 'D934'    ,'AB'           ,   11438924 ,   11438924 ,         0 , null
union select 'D934'    ,'AC'           ,   12556213 ,   12556213 ,         0 , null
union select 'D934'    ,'AC'           ,   12556213 ,   12556214 ,         5 , null
union select 'D934'    ,'AC'           ,   12556214 ,   12556213 ,         5 , null
union select 'D934'    ,'AC'           ,   12556214 ,   12556214 ,         0 , null
union select 'D955'    ,'A'            ,   75346846 ,   75346846 ,         0 , null
)

编辑: 为了更好地理解问题,我画了 5 个部件号D934 A、它们的链接和两个集群。

我们感兴趣的链接是黑色的(因为这意味着partNumbers之间的差异为0)。

橙色链接表示零件编号之间的差异>0。

绘制链接后,我们可以识别 2 个集群,我用红色圆圈绘制了这些集群。

在此处输入图像描述

标签: sqlsql-serveralgorithmcluster-analysisdata-mining

解决方案


我设法用一个存储过程解决了这个问题:

DECLARE @BiGroup                    [nvarchar](30);
DECLARE @VarianceName               [nvarchar](30);
DECLARE @NewBiGroup                 [nvarchar](30);
DECLARE @NewVarianceName            [nvarchar](30);
DECLARE @PartNumber                 [nvarchar](30);
DECLARE @ClusterName                [nvarchar](30);
DECLARE @IncrementClusterName       [nvarchar](30);

set @BiGroup = 'first_BiGroup';
set @VarianceName = 'first_VarianceName';
set @IncrementClusterName = 1;
set @ClusterName = null;

-- Declare cursor
DECLARE cur CURSOR READ_ONLY FOR
Select [PartNumber1] FROM t1
order by [BIGroup] ,[VarianceName] ,[PartNumber1];

--clean cluster column
update t1 set [Cluster]=null;

OPEN cur
FETCH NEXT FROM cur INTO @PartNumber

-- Loop on every PartNumber
WHILE @@FETCH_STATUS = 0
BEGIN

  --set NewBiGroup and NewPartNumber
  set @NewBiGroup = (select Top(1) [BIGroup] from t1 where partnumber1 = @PartNumber);
  set @NewVarianceName = (select Top(1) [VarianceName] from t1 where partnumber1 = @PartNumber);

  --check if we are still in the same BIGroup and Variance, otherwise, reset the cluster increment
  if @NewBiGroup <> @BiGroup or @NewVarianceName <> @VarianceName 
  BEGIN
    set @IncrementClusterName = 1;
  END

  --get the clusterName of this partNumber, if it exists
  set @ClusterName = (select Top(1) [Cluster] from t1 where partnumber2 = @PartNumber and [Cluster] is not null);

  --if ClusterName is NULL, put a clustername and then increment the @IncrementClusterName, 
  --otherwise set the cluster to @ClusterName
  if @ClusterName is null
  BEGIN
    update t1 set [Cluster] = @NewBiGroup+'-'+@NewVarianceName+'-'+@IncrementClusterName
    where partnumber1 = @PartNumber  
    and Difference= 0 ;

    set @IncrementClusterName = @IncrementClusterName +1;
  END
  else
  BEGIN
    update t1 set [Cluster] = @NewBiGroup+'-'+@NewVarianceName+'-'+@ClusterName
    where partnumber1 = @PartNumber  
    and Difference= 0 ;
  END

  -- setting the BiGroup and VarianceName
  set @BiGroup = @NewBiGroup;
  set @VarianceName = @NewVarianceName;

 FETCH NEXT FROM cur INTO @PartNumber
END

CLOSE cur
DEALLOCATE cur

存储过程的算法是这样工作的:

  • 对于每个PartNumber差异=0
    • 如果BiGroupVarianceName已更改
      • 我重置@clusterIncrement为 1
    • 如果他还不是一个Cluster
      • 我将他设置Cluster@clusterIncrement
      • @clusterIncrement = @clusterIncrement +1
    • 如果他已经是一个Cluster
      • 我将他设置Cluster为现有的Cluster

推荐阅读