首页 > 解决方案 > 如何将连接子图 ID 分配给 Oracle SQL 中无向图中的每个节点?

问题描述

如何在由 and 定义的 link_tbl 中划分一组from_node链接to_node。有 10M 个节点和 25M 个链接(每个节点不超过 20 个链接)。

例如
在此处输入图像描述

该图由三个不相交的子图组成。

  create table link_tbl as (
  select 'AY' as linkid, 'A' as from_node, 'Y' as to_node from dual union all
  select 'AB', 'A', 'B' from dual union all
  select 'CA', 'C', 'A' from dual union all      
  select 'GE', 'G', 'E' from dual union all
  select 'CB', 'C', 'B' from dual union all
  select 'EF', 'E', 'F' from dual union all
  select 'NM', 'N', 'M' from dual
  );
  --compute subnetid
  select * from link_tbl order by subnetid;

要获得带有subnetid值的结果集?

在此处输入图像描述

我可以使用 Java 中的洪水填充变体将子图 id 分配给图的每个节点。但这可以在 SQL 中完成吗?

伪代码:
- 使用连续整数对节点进行排名。- 将链接表示为(node1*100M + node2) as number(19,0) - 将节点 1、节点 2 与节点 2、节点 1 和排序联合 - 获取第一个节点作为锚点node并添加到subgraph_nodes' - Iterate over link table -添加节点 2to subgraph_nodes if node1 is in (subgraph_nodes) AND node2 not in subgraph_nodes

这应该将所有连接的节点添加到subgraph_nodes

这已经足够了,因为现在我可以将 subgraph_id' 添加到节点表并选择所有没有的节点subgraph id并重复子图分析。

有 10M 个节点表示为连续整数和 25M 个链接(每个节点不超过 20 个链接),表示为(from_node*100M + to_node) as ID, from_node, to_node.

标签: sqlgraphoracle11g

解决方案


前段时间,我写了一个问题的答案How to find all connected subgraphs of an undirected graph。它是为 SQL Server 编写的,但 Oracle 支持标准递归查询,因此很容易将其转换为 Oracle。使用特定于 Oracle 的结构可以更有效地编写它。

样本数据

create table link_tbl as (
  select 'AY' as linkid, 'A' as from_node, 'Y' as to_node from dual union all
  select 'AB', 'A', 'B' from dual union all
  select 'CA', 'C', 'A' from dual union all      
  select 'GE', 'G', 'E' from dual union all
  select 'CB', 'C', 'B' from dual union all
  select 'EF', 'E', 'F' from dual union all
  select 'NM', 'N', 'M' from dual
);

询问

WITH
CTE_Nodes
AS
(
    SELECT from_node AS Node
    FROM link_tbl

    UNION

    SELECT to_node AS Node
    FROM link_tbl
)
,CTE_Pairs
AS
(
    SELECT from_node AS Node1, to_node AS Node2
    FROM link_tbl
    WHERE from_node <> to_node

    UNION

    SELECT to_node AS Node1, from_node AS Node2
    FROM link_tbl
    WHERE from_node <> to_node
)
,CTE_Recursive (AnchorNode, Node1, Node2, NodePath, Lvl)
AS
(
    SELECT
        CAST(CTE_Nodes.Node AS varchar(2000)) AS AnchorNode
        , Node1
        , Node2
        , CAST(',' || Node1 || ',' || Node2 || ',' AS varchar(2000)) AS NodePath
        , 1 AS Lvl
    FROM 
        CTE_Pairs
        INNER JOIN CTE_Nodes ON CTE_Nodes.Node = CTE_Pairs.Node1

    UNION ALL

    SELECT 
        CTE_Recursive.AnchorNode
        , CTE_Pairs.Node1
        , CTE_Pairs.Node2
        , CAST(CTE_Recursive.NodePath || CTE_Pairs.Node2 || ',' AS varchar(2000)) AS NodePath
        , CTE_Recursive.Lvl + 1 AS Lvl
    FROM
        CTE_Pairs
        INNER JOIN CTE_Recursive ON CTE_Recursive.Node2 = CTE_Pairs.Node1
    WHERE
        CTE_Recursive.NodePath NOT LIKE CAST('%,' || CTE_Pairs.Node2 || ',%' AS varchar(2000))
)
,CTE_RecursionResult
AS
(
    SELECT AnchorNode, Node1, Node2
    FROM CTE_Recursive
)
,CTE_CleanResult
AS
(
    SELECT AnchorNode, Node1 AS Node
    FROM CTE_RecursionResult

    UNION

    SELECT AnchorNode, Node2 AS Node
    FROM CTE_RecursionResult
)
SELECT
    CTE_Nodes.Node
    ,LISTAGG(CTE_CleanResult.Node, ',') WITHIN GROUP (ORDER BY CTE_CleanResult.Node) AS GroupMembers
    ,DENSE_RANK() OVER (ORDER BY LISTAGG(CTE_CleanResult.Node, ',') WITHIN GROUP (ORDER BY CTE_CleanResult.Node)) AS GroupID
FROM
    CTE_Nodes
    INNER JOIN CTE_CleanResult ON CTE_CleanResult.AnchorNode = CTE_Nodes.Node
GROUP BY
    CTE_Nodes.Node
ORDER BY
    GroupID
    ,CTE_Nodes.Node
;

结果

+------+--------------+---------+
| NODE | GROUPMEMBERS | GROUPID |
+------+--------------+---------+
| A    | A,B,C,Y      |       1 |
| B    | A,B,C,Y      |       1 |
| C    | A,B,C,Y      |       1 |
| Y    | A,B,C,Y      |       1 |
| E    | E,F,G        |       2 |
| F    | E,F,G        |       2 |
| G    | E,F,G        |       2 |
| M    | M,N          |       3 |
| N    | M,N          |       3 |
+------+--------------+---------+

https://dbfiddle.uk/?rdbms=oracle_11.2&fiddle=e61cf73824e7718a4686430ccd7398e7

这个怎么运作

CTE_Nodes

CTE_Nodes给出出现在from_nodeto_node列中的所有节点的列表. 由于它们可以以任何顺序出现,我们UNION将两列放在一起。UNION还会删除任何重复项。

+------+
| NODE |
+------+
| A    |
| B    |
| C    |
| E    |
| F    |
| G    |
| M    |
| N    |
| Y    |
+------+

CTE_Pairs

CTE_Pairs给出图在两个方向上的所有边的列表. 同样,UNION用于删除任何重复项。

+-------+-------+
| NODE1 | NODE2 |
+-------+-------+
| A     | B     |
| A     | C     |
| A     | Y     |
| B     | A     |
| B     | C     |
| C     | A     |
| C     | B     |
| E     | F     |
| E     | G     |
| F     | E     |
| G     | E     |
| M     | N     |
| N     | M     |
| Y     | A     |
+-------+-------+

CTE_Recursive

CTE_Recursive是从每个唯一节点开始递归遍历图的查询的主要部分。这些起始行由 的第一部分生成UNION ALLUNION ALL递归连接到自身的第二部分链接Node2Node1. 由于我们预先制作CTE_Pairs了双向写入的所有边,因此我们始终只能链接Node2Node1并且我们将获得图中的所有路径。同时查询构建NodePath- 到目前为止已经遍历的一串逗号分隔的节点。它用于WHERE过滤器:

CTE_Recursive.NodePath NOT LIKE CAST('%,' || CTE_Pairs.Node2 || ',%' AS varchar(2000))

一旦我们遇到之前已经包含在路径中的节点,递归就会停止,因为连接的节点列表已用尽。 AnchorNode是递归的起始节点,稍后将用于对结果进行分组。 Lvl并没有真正使用,我将其包括在内是为了更好地了解正在发生的事情。

+------------+-------+-------+-----------+-----+
| ANCHORNODE | NODE1 | NODE2 | NODEPATH  | LVL |
+------------+-------+-------+-----------+-----+
| A          | A     | Y     | ,A,Y,     |   1 |
| A          | A     | C     | ,A,C,     |   1 |
| A          | A     | B     | ,A,B,     |   1 |
| B          | B     | C     | ,B,C,     |   1 |
| B          | B     | A     | ,B,A,     |   1 |
| C          | C     | B     | ,C,B,     |   1 |
| C          | C     | A     | ,C,A,     |   1 |
| E          | E     | G     | ,E,G,     |   1 |
| E          | E     | F     | ,E,F,     |   1 |
| F          | F     | E     | ,F,E,     |   1 |
| G          | G     | E     | ,G,E,     |   1 |
| M          | M     | N     | ,M,N,     |   1 |
| N          | N     | M     | ,N,M,     |   1 |
| Y          | Y     | A     | ,Y,A,     |   1 |
| Y          | A     | B     | ,Y,A,B,   |   2 |
| C          | A     | B     | ,C,A,B,   |   2 |
| Y          | A     | C     | ,Y,A,C,   |   2 |
| B          | A     | C     | ,B,A,C,   |   2 |
| C          | A     | Y     | ,C,A,Y,   |   2 |
| B          | A     | Y     | ,B,A,Y,   |   2 |
| C          | B     | A     | ,C,B,A,   |   2 |
| A          | B     | C     | ,A,B,C,   |   2 |
| B          | C     | A     | ,B,C,A,   |   2 |
| A          | C     | B     | ,A,C,B,   |   2 |
| G          | E     | F     | ,G,E,F,   |   2 |
| F          | E     | G     | ,F,E,G,   |   2 |
| B          | A     | Y     | ,B,C,A,Y, |   3 |
| C          | A     | Y     | ,C,B,A,Y, |   3 |
| Y          | B     | C     | ,Y,A,B,C, |   3 |
| Y          | C     | B     | ,Y,A,C,B, |   3 |
+------------+-------+-------+-----------+-----+

CTE_CleanResult

CTE_CleanResult只留下相关部分 fromCTE_Recursive并再次合并Node1Node2using UNION

+------------+------+
| ANCHORNODE | NODE |
+------------+------+
| A          | A    |
| A          | B    |
| A          | C    |
| A          | Y    |
| B          | A    |
| B          | B    |
| B          | C    |
| B          | Y    |
| C          | A    |
| C          | B    |
| C          | C    |
| C          | Y    |
| E          | E    |
| E          | F    |
| E          | G    |
| F          | E    |
| F          | F    |
| F          | G    |
| G          | E    |
| G          | F    |
| G          | G    |
| M          | M    |
| M          | N    |
| N          | M    |
| N          | N    |
| Y          | A    |
| Y          | B    |
| Y          | C    |
| Y          | Y    |
+------------+------+

最终选择

现在我们需要Node为每个AnchorNode. LISTAGG可以。 DENSE_RANK()计算GroupID每个 的数字AnchorNode


效率

您有相当大的表,因此尝试在上面的单个查询中一次查找所有组可能效率很低。

提高效率的一种方法是不对整个数据集使用单个查询。不要尝试同时查找所有子网/组。将起点限制为一个节点。添加WHERE CTE_Nodes.Node = 'some node'到的第一部分CTE_Recursive。该查询将找到一个子网的所有节点。从大表中删除这些找到的节点,选择另一个起始节点,循环重复,直到大表为空。如果您实现CTE_NodesCTE_Pairs作为带有索引的临时表,它也可能会有所帮助。

我从未使用过 Oracle,也不知道它的过程代码语法,所以我将在下面写一个伪代码。

准备临时表

CREATE TABLE Nodes AS
(
    SELECT from_node AS Node
    FROM link_tbl

    UNION

    SELECT to_node AS Node
    FROM link_tbl
);
CREATE INDEX IX_Node ON Nodes (Node);

CREATE TABLE Pairs AS
(
    SELECT from_node AS Node1, to_node AS Node2
    FROM link_tbl
    WHERE from_node <> to_node

    UNION

    SELECT to_node AS Node1, from_node AS Node2
    FROM link_tbl
    WHERE from_node <> to_node
);
CREATE INDEX IX_Node1 ON Pairs (Node1);
CREATE INDEX IX_Node2 ON Pairs (Node2);

CREATE TABLE Subgraph AS
(
    SELECT Node FROM Nodes WHERE 1=0
);

CREATE TABLE Result
(
    GroupID int NOT NULL,
    Node varchar(10) NOT NULL
);

SET :GroupID = 0;

主循环开始

Node从中选择一个Nodes。任何行都可以。保存Nodeinto 变量。同样,我不知道正确的 Oracle 语法。

SELECT :N = Node FROM Nodes WHERE rownum=1;

如果Nodes为空,则停止循环。

SET :GroupID = :GroupID + 1;

运行递归查询,从Node上面选择的一个特定开始递归。

INSERT INTO Subgraph (Node)
WITH
CTE_Recursive (AnchorNode, Node1, Node2, NodePath, Lvl)
AS
(
    SELECT
        CAST(Nodes.Node AS varchar(2000)) AS AnchorNode
        , Node1
        , Node2
        , CAST(',' || Node1 || ',' || Node2 || ',' AS varchar(2000)) AS NodePath
        , 1 AS Lvl
    FROM 
        Pairs
        INNER JOIN Nodes ON Nodes.Node = Pairs.Node1
    WHERE
        Nodes.Node = :N  -- 'A'
        -- use variable here, don't know what the syntax is

    UNION ALL

    SELECT 
        CTE_Recursive.AnchorNode
        , Pairs.Node1
        , Pairs.Node2
        , CAST(CTE_Recursive.NodePath || Pairs.Node2 || ',' AS varchar(2000)) AS NodePath
        , CTE_Recursive.Lvl + 1 AS Lvl
    FROM
        Pairs
        INNER JOIN CTE_Recursive ON CTE_Recursive.Node2 = Pairs.Node1
    WHERE
        CTE_Recursive.NodePath NOT LIKE CAST('%,' || Pairs.Node2 || ',%' AS varchar(2000))
)
,CTE_Result
AS
(
    SELECT Node1 AS Node
    FROM CTE_Recursive

    UNION

    SELECT Node2 AS Node
    FROM CTE_Recursive
)
SELECT Node FROM CTE_Result
;

此查询将返回连接到给定起始节点的所有节点,即形成子图的节点。将其结果集保存到Subgraph临时表中。

将结果追加到最终Result表中,并为找到的子图分配一个 ID。

INSERT INTO Result (GroupID, Node)
SELECT :GroupID, Node
FROM Subgraph;

Nodes从和中删除已处理的节点Pairs

DELETE FROM Nodes
WHERE Node IN (SELECT Node FROM Subgraph)
;

DELETE FROM Pairs
WHERE 
    Node1 IN (SELECT Node FROM Subgraph)
    OR
    Node2 IN (SELECT Node FROM Subgraph)
;

清洁Subgraph桌子

DELETE FROM Subgraph; 

返回到循环的开头。

在最终Result表中将具有具有相应子图 ID 的所有节点。


实际上,您可以进一步简化它。你不需要Nodes桌子,Pairs就足够了。


推荐阅读