python - 查找从叶子到森林中每个节点的所有路径
问题描述
我部分地问了这个问题,因为我没有足够的信息,但现在我有了,我可以问完整的问题。所以我在文本文件中有两列数据。第一列是前任,第二列是后继。我使用以下代码加载数据:
[line.split() for line in open('data.txt', encoding ='utf-8')]
假设我们的数据在文件中如下所示:
ANALYTICAL_BALANCE BFG_DEPOSIT
CUSTOMER_DETAIL BALANCE
BFG_2056 FFD_15
BALANCE BFG_16
BFG_16 STAT_HIST
ANALYTICAL_BALANCE BFG_2056
CUSTOM_DATA AND_11
AND_11 DICT_DEAL
DICT_DEAL BFG_2056
并在加载后
[[ANALYTICAL_BALANCE,BFG_DEPOSIT],
[CUSTOMER_DETAIL,BALANCE],
[BFG_2056, FFD_15],
[BALANCE,BFG_16],
[BFG_16,STAT_HIST],
[ANALYTICAL_BALANCE,BFG_2056],
[CUSTOM_DATA,AND_11],
[AND_11,DICT_DEAL],
[DICT_DEAL,BFG_2056]]
然后我想连接这些数据。我创建了邻接列表:
def create_adj(edges):
adj = {} # or use defaultdict(list) to avoid `if` in the loop below
for a, b in edges:
if not a in adj:
adj[a] = []
if not b in adj:
adj[b] = []
adj[a].append(b)
return adj
并获取所有路径:
def all_paths(adj):
def recur(path):
node = path[-1]
neighbors = [neighbor for neighbor in adj[node] if not neighbor in path]
if not neighbors:
yield path
for neighbor in neighbors:
yield from recur(path + [neighbor])
for node in adj:
yield from recur([node])
所以输出看起来像这样:
data = [
["ANALYTICAL_BALANCE","BFG_DEPOSIT"],
["CUSTOMER_DETAIL","BALANCE"],
["BFG_2056", "FFD_15"],
["BALANCE","BFG_16"],
["BFG_16","STAT_HIST"],
["ANALYTICAL_BALANCE","BFG_2056"],
["CUSTOM_DATA","AND_11"],
["AND_11","DICT_DEAL"],
["DICT_DEAL","BFG_2056"]
]
adj = create_adj(data)
print([path for path in all_paths(adj) if len(path) > 1])
[ANALYTICAL_BALANCE,BFG_DEPOSIT]
[CUSTOMER_DETAIL,BALANCE,BFG_16,STAT_HIST]
[BFG_2056,FFD_15]
[BALANCE,BFG_16,STAT_HIST]
[ANALYTICAL_BALANCE,BFG_2056,FFD_15]
[CUSTOM_DATA,AND_11,DICT_DEAL,BFG_2056,FFD_15]
[AND_11,DICT_DEAL,BFG_2056,FFD_15]
[DICT_DEAL,BFG_2056,FFD_15]
我们可以将连接可视化为创建森林的单独树。由于输入数据的性质,这些树不会有任何循环。
现在我的问题是。如何获得从叶子到每棵树的每个节点的每个连接?我的意思是什么。我们有 3 棵树,所以我将从顶部的那棵开始。
Tree1:
ANALYTICAL_BALANCE BFG_DEPOSIT
Tree2:
ANALYTICAL_BALANCE BFG_2056
ANALYTICAL_BALANCE FFD_15
CUSTOM_DATA AND_11
CUSTOM_DATA DICT_DEAL
CUSTOM_DATA BFG_2056
CUSTOM_DATA FFD_15
Tree3:
CUSTOMER_DETAIL BALANCE
CUSTOMER_DETAIL BFG_16
CUSTOMER_DETAIL STAT_HIST
如您所见,我的第一次尝试是创建邻接列表并查找所有路径。然后我会删除非叶子节点之间的连接并过滤数据。输入 150 行很好,但是当我输入 13k 行的完整文件时,代码编译了 2 天,没有任何结束的迹象。因此,我正在为我的工作寻找最有效的代码或算法以及最佳数据类型(列表、数据框等)。任何帮助将不胜感激,因为我已经与它斗争了几天,我找不到任何关于如何解决这个问题的想法。如果有不清楚的地方,我会编辑帖子。
数据将使用 openpyxl 保存到 excel 文件中,因此当我按后继过滤时,我可以看到前任列中连接到此后继的每一片叶子。
这是我的整个代码。
# create adjacencies
def create_adj(edges):
adj = {}
for a, b in edges:
if not a in adj:
adj[a] = []
if not b in adj:
adj[b] = []
adj[a].append(b)
return adj
# find all paths
def all_paths(adj):
def recur(path):
node = path[-1]
neighbors = [neighbor for neighbor in adj[node] if not neighbor in path]
if not neighbors:
yield path
for neighbor in neighbors:
yield from recur(path + [neighbor])
for node in adj:
yield from recur([node])
# delete the connections from list
def conn_deletion(list, list_deletion):
after_del = [x for x in list if x[0] not in list_deletion]
return after_del
# get paths where len of path is > 2 and save them as a leaf to node. Also save connections to deletion.
def unpack_paths(my_list):
list_of_more_succ = []
to_deletion = []
for item in my_list:
if len(item) == 1:
print("len 1",item)
to_deletion.append(item[0])
elif len(item) > 2:
for i in range(1, len(item) - 1):
to_deletion.append(item[i])
print("len > 2", item[i])
if [item[0], item[i]] in list_of_more_succ:
pass
else:
list_of_more_succ.append([item[0], item[i]])
list_concat = my_list + list_of_more_succ
sorted_list = list(k for k, _ in itertools.groupby(list_concat))
final = conn_deletion(sorted_list, list(dict.fromkeys(to_deletion)))
return final
data = [line.split() for line in open('data.txt', encoding='utf-8')]
adj = create_adj(data)
print(adj)
workbook = Workbook()
sheet = workbook.active
sheet["A1"] = "Source"
sheet["B1"] = "Child"
loaded = list(all_paths(adj))
final_edited = unpack_paths(loaded)
# Save data to excel file. We don't want paths with len == 1 or > 2.
for row, item in enumerate(final_edited, start=2):
if len(item) > 2:
pass
elif len(item) == 1:
pass
else:
sheet[f"A{row}"] = item[0]
sheet[f"B{row}"] = item[1]
workbook.save("DataMap.xlsx")
解决方案
我建议更改all_paths
为leaf_paths
,这意味着它只会产生从叶子开始的那些路径。
然后使用这些路径:
- 识别它通向的根(路径中的最后一个元素)
- 识别叶子(路径中的第一个元素)
- 迭代该路径中的所有非叶子并将它们中的每一个与叶子组合成一对。
- 将这些对存储在以根为键的字典中
以下是您将如何all_paths
在标有注释的两个位置进行更改:
def leaf_paths(adj):
def recur(path):
node = path[-1]
neighbors = [neighbor for neighbor in adj[node] if not neighbor in path]
if not neighbors:
yield path
for neighbor in neighbors:
yield from recur(path + [neighbor])
# identify the internal nodes (not leaves)
internals = set(parent for parents in adj.values() for parent in parents)
for node in adj:
if not node in internals: # require that the starting node is a leaf
yield from recur([node])
然后添加这个函数:
def all_leaf_pairs(paths):
trees = {}
for path in paths:
if len(path) > 1:
root = path[-1]
if not root in trees:
trees[root] = []
it = iter(path)
leaf = next(it)
trees[root].extend((leaf, node) for node in it)
return trees
你的主程序会做:
data = [
["ANALYTICAL_BALANCE","BFG_DEPOSIT"],
["CUSTOMER_DETAIL","BALANCE"],
["BFG_2056", "FFD_15"],
["BALANCE","BFG_16"],
["BFG_16","STAT_HIST"],
["ANALYTICAL_BALANCE","BFG_2056"],
["CUSTOM_DATA","AND_11"],
["AND_11","DICT_DEAL"],
["DICT_DEAL","BFG_2056"]
]
adj = create_adj(data)
paths = leaf_paths(adj)
import pprint
pprint.pprint(all_leaf_pairs(paths))
这将输出:
{'BFG_DEPOSIT': [('ANALYTICAL_BALANCE', 'BFG_DEPOSIT')],
'FFD_15': [('ANALYTICAL_BALANCE', 'BFG_2056'),
('ANALYTICAL_BALANCE', 'FFD_15'),
('CUSTOM_DATA', 'AND_11'),
('CUSTOM_DATA', 'DICT_DEAL'),
('CUSTOM_DATA', 'BFG_2056'),
('CUSTOM_DATA', 'FFD_15')],
'STAT_HIST': [('CUSTOMER_DETAIL', 'BALANCE'),
('CUSTOMER_DETAIL', 'BFG_16'),
('CUSTOMER_DETAIL', 'STAT_HIST')]}
的解释leaf_paths
该函数使用递归。它recur
在其范围内定义。
但主要代码从识别内部节点开始(即至少有一个孩子的节点)。由于adj
为给定节点提供了父节点,我们只需要收集所有这些父节点。
我们使用这组内部节点来确保仅在叶节点上开始递归,因为在输出中我们希望路径始终以叶节点开始。
该recur
函数将从给定的叶子走向它可以找到的任何根。它用它可以找到的下一个父级(neighbor
)扩展路径并执行递归,直到不再有父级(即,它是一个根)。当这种情况发生时,累积的路径(以叶子开始并以根结束)被产生。
leaf_paths
本身产生任何产生的路径recur
。
推荐阅读
- python - 在不加载到内存的情况下估计 pandas 数据框的大小
- raku - 有没有办法安全地重新声明符号?
- sql - Oracle SQL 逗号格式?
- tensorflow - external/local_config_mlir/include/mlir/IR/Attributes.h:783:20:内部编译器错误:在assign_temp中,在function.c:968
- django - 递归遍历模型
- angular - Angular没有将表单数据推送到数组
- javascript - 为 JS 变量提取 JSON 键
- c# - .net 慢 SqlDataReader
- python - Django REST Framework - 如何禁用非员工用户的可浏览 API (is_staff=False)
- excel - 为什么在powershell中调用quit()后excel.exe没有关闭?