首页 > 解决方案 > 查找从叶子到森林中每个节点的所有路径

问题描述

我部分地问了这个问题,因为我没有足够的信息,但现在我有了,我可以问完整的问题。所以我在文本文件中有两列数据。第一列是前任,第二列是后继。我使用以下代码加载数据:

[line.split() for line in open('data.txt', encoding ='utf-8')]

假设我们的数据在文件中如下所示:

ANALYTICAL_BALANCE BFG_DEPOSIT 
CUSTOMER_DETAIL BALANCE 
BFG_2056 FFD_15 
BALANCE BFG_16 
BFG_16 STAT_HIST 
ANALYTICAL_BALANCE BFG_2056 
CUSTOM_DATA AND_11 
AND_11 DICT_DEAL 
DICT_DEAL BFG_2056 

并在加载后

[[ANALYTICAL_BALANCE,BFG_DEPOSIT],
[CUSTOMER_DETAIL,BALANCE],
[BFG_2056, FFD_15],
[BALANCE,BFG_16],
[BFG_16,STAT_HIST],
[ANALYTICAL_BALANCE,BFG_2056],
[CUSTOM_DATA,AND_11],
[AND_11,DICT_DEAL],
[DICT_DEAL,BFG_2056]]

然后我想连接这些数据。我创建了邻接列表:

def create_adj(edges):
    adj = {}   # or use defaultdict(list) to avoid `if` in the loop below
    for a, b in edges:
        if not a in adj:
            adj[a] = []
        if not b in adj:
            adj[b] = []
        adj[a].append(b)
    return adj

并获取所有路径:

def all_paths(adj):
    def recur(path):
        node = path[-1]
        neighbors = [neighbor for neighbor in adj[node] if not neighbor in path]
        if not neighbors:
            yield path
        for neighbor in neighbors:
            yield from recur(path + [neighbor])

    for node in adj:
        yield from recur([node])

所以输出看起来像这样:

data = [
    ["ANALYTICAL_BALANCE","BFG_DEPOSIT"],
    ["CUSTOMER_DETAIL","BALANCE"],
    ["BFG_2056", "FFD_15"],
    ["BALANCE","BFG_16"],
    ["BFG_16","STAT_HIST"],
    ["ANALYTICAL_BALANCE","BFG_2056"],
    ["CUSTOM_DATA","AND_11"],
    ["AND_11","DICT_DEAL"],
    ["DICT_DEAL","BFG_2056"]
]

adj = create_adj(data)

print([path for path in all_paths(adj) if len(path) > 1])


[ANALYTICAL_BALANCE,BFG_DEPOSIT]
[CUSTOMER_DETAIL,BALANCE,BFG_16,STAT_HIST]
[BFG_2056,FFD_15]
[BALANCE,BFG_16,STAT_HIST]
[ANALYTICAL_BALANCE,BFG_2056,FFD_15]
[CUSTOM_DATA,AND_11,DICT_DEAL,BFG_2056,FFD_15]
[AND_11,DICT_DEAL,BFG_2056,FFD_15]
[DICT_DEAL,BFG_2056,FFD_15]

我们可以将连接可视化为创建森林的单独树。由于输入数据的性质,这些树不会有任何循环。 连接可视化

现在我的问题是。如何获得从叶子到每棵树的每个节点的每个连接?我的意思是什么。我们有 3 棵树,所以我将从顶部的那棵开始。

Tree1:
ANALYTICAL_BALANCE BFG_DEPOSIT

Tree2:
ANALYTICAL_BALANCE BFG_2056
ANALYTICAL_BALANCE FFD_15
CUSTOM_DATA AND_11
CUSTOM_DATA DICT_DEAL
CUSTOM_DATA BFG_2056
CUSTOM_DATA FFD_15

Tree3:
CUSTOMER_DETAIL BALANCE
CUSTOMER_DETAIL BFG_16
CUSTOMER_DETAIL STAT_HIST

如您所见,我的第一次尝试是创建邻接列表并查找所有路径。然后我会删除非叶子节点之间的连接并过滤数据。输入 150 行很好,但是当我输入 13k 行的完整文件时,代码编译了 2 天,没有任何结束的迹象。因此,我正在为我的工作寻找最有效的代码或算法以及最佳数据类型(列表、数据框等)。任何帮助将不胜感激,因为我已经与它斗争了几天,我找不到任何关于如何解决这个问题的想法。如果有不清楚的地方,我会编辑帖子。
数据将使用 openpyxl 保存到 excel 文件中,因此当我按后继过滤时,我可以看到前任列中连接到此后继的每一片叶子
这是我的整个代码。

# create adjacencies
def create_adj(edges):
    adj = {}
    for a, b in edges:
        if not a in adj:
            adj[a] = []
        if not b in adj:
            adj[b] = []
        adj[a].append(b)
    return adj
 
# find all paths
def all_paths(adj):
    def recur(path):
        node = path[-1]
        neighbors = [neighbor for neighbor in adj[node] if not neighbor in path]
        if not neighbors:
            yield path
        for neighbor in neighbors:
            yield from recur(path + [neighbor])
 
    for node in adj:
        yield from recur([node])
 
# delete the  connections from list 
def conn_deletion(list, list_deletion):
    after_del = [x for x in list if x[0] not in list_deletion]
    return after_del
 
 # get paths where len of path is > 2 and save them as a leaf to node. Also save connections to deletion.
def unpack_paths(my_list):
    list_of_more_succ = []
    to_deletion = []
    for item in my_list:
        if len(item) == 1:
            print("len 1",item)
            to_deletion.append(item[0])
        elif len(item) > 2:
            for i in range(1, len(item) - 1):
                to_deletion.append(item[i])
                print("len > 2", item[i])
                if [item[0], item[i]] in list_of_more_succ:
                    pass
                else:
                    list_of_more_succ.append([item[0], item[i]])
    list_concat = my_list + list_of_more_succ
    sorted_list = list(k for k, _ in itertools.groupby(list_concat))
    final = conn_deletion(sorted_list, list(dict.fromkeys(to_deletion)))
    return final
 
 
data = [line.split() for line in open('data.txt', encoding='utf-8')]
 
adj = create_adj(data)
print(adj)
 
workbook = Workbook()
sheet = workbook.active
sheet["A1"] = "Source"
sheet["B1"] = "Child"
 
loaded = list(all_paths(adj))
 
final_edited = unpack_paths(loaded)
 
# Save data to excel file. We don't want paths with len == 1 or > 2.
for row, item in enumerate(final_edited, start=2):
    if len(item) > 2:
        pass
    elif len(item) == 1:
        pass
    else:
        sheet[f"A{row}"] = item[0]
        sheet[f"B{row}"] = item[1]
workbook.save("DataMap.xlsx")

标签: pythonalgorithmoptimizationtree

解决方案


我建议更改all_pathsleaf_paths,这意味着它只会产生从叶子开始的那些路径。

然后使用这些路径:

  • 识别它通向的根(路径中的最后一个元素)
  • 识别叶子(路径中的第一个元素)
  • 迭代该路径中的所有非叶子并将它们中的每一个与叶子组合成一对。
  • 将这些对存储在以根为键的字典中

以下是您将如何all_paths在标有注释的两个位置进行更改:

def leaf_paths(adj):
    def recur(path):
        node = path[-1]
        neighbors = [neighbor for neighbor in adj[node] if not neighbor in path]
        if not neighbors:
            yield path
        for neighbor in neighbors:
            yield from recur(path + [neighbor])

    # identify the internal nodes (not leaves)
    internals = set(parent for parents in adj.values() for parent in parents) 
    for node in adj:
        if not node in internals:  # require that the starting node is a leaf
            yield from recur([node])

然后添加这个函数:

def all_leaf_pairs(paths):
    trees = {}

    for path in paths:
        if len(path) > 1:
            root = path[-1]
            if not root in trees:
                trees[root] = []
            it = iter(path)
            leaf = next(it)
            trees[root].extend((leaf, node) for node in it)
    return trees

你的主程序会做:

data = [
    ["ANALYTICAL_BALANCE","BFG_DEPOSIT"],
    ["CUSTOMER_DETAIL","BALANCE"],
    ["BFG_2056", "FFD_15"],
    ["BALANCE","BFG_16"],
    ["BFG_16","STAT_HIST"],
    ["ANALYTICAL_BALANCE","BFG_2056"],
    ["CUSTOM_DATA","AND_11"],
    ["AND_11","DICT_DEAL"],
    ["DICT_DEAL","BFG_2056"]
]

adj = create_adj(data)
paths = leaf_paths(adj)

import pprint
pprint.pprint(all_leaf_pairs(paths))

这将输出:

{'BFG_DEPOSIT': [('ANALYTICAL_BALANCE', 'BFG_DEPOSIT')],
 'FFD_15': [('ANALYTICAL_BALANCE', 'BFG_2056'),
            ('ANALYTICAL_BALANCE', 'FFD_15'),
            ('CUSTOM_DATA', 'AND_11'),
            ('CUSTOM_DATA', 'DICT_DEAL'),
            ('CUSTOM_DATA', 'BFG_2056'),
            ('CUSTOM_DATA', 'FFD_15')],
 'STAT_HIST': [('CUSTOMER_DETAIL', 'BALANCE'),
               ('CUSTOMER_DETAIL', 'BFG_16'),
               ('CUSTOMER_DETAIL', 'STAT_HIST')]}

的解释leaf_paths

该函数使用递归。它recur在其范围内定义。

但主要代码从识别内部节点开始(即至少有一个孩子的节点)。由于adj为给定节点提供了父节点,我们只需要收集所有这些父节点。

我们使用这组内部节点来确保仅在叶节点上开始递归,因为在输出中我们希望路径始终以叶节点开始。

recur函数将从给定的叶子走向它可以找到的任何根。它用它可以找到的下一个父级(neighbor)扩展路径并执行递归,直到不再有父级(即,它是一个根)。当这种情况发生时,累积的路径(以叶子开始并以根结束)被产生

leaf_paths本身产生任何产生的路径recur


推荐阅读