首页 > 解决方案 > Efficiently remove partial duplicates in a list of tuples

问题描述

I have a list of tuples, the list can vary in length between ~8 - 1000 depending on the length of the tuples. Each tuple in the list is unique. A tuple is of length N where each entry is a generic word.

An example tuple can be of length N (Word 1, Word 2, Word 3, ..., Word N)

For any tuple in the list, element j in said tuple will either be '' or Word j

A very simplified example with alphabetic letters would be

l = [('A', 'B', '', ''), ('A', 'B', 'C', ''), 
     ('', '', '', 'D'), ('A', '', '', 'D'), 
     ('', 'B', '', '')]

Every position at each tuple will either have the same value or be empty. I want to remove all the tuples which have all their non '' values in another tuple at the same position. As an example, (A,B,'','') has all its non '' values in (A,B,C,'') and should therefore be removed.

filtered_l = [(A,B,C,''),(A,'','',D)]

The length of the tuples is always of the same length (not necessarily 4). The length of the tuples would be between 2-10.

What is the fastest way to do this?

标签: pythonlistperformancetuples

解决方案


让我们将每个元组概念化为一个二进制数组,其中 1 是“包含某些东西”,而 2 是“包含一个空字符串”。由于每个位置的项目都是相同的,我们不需要关心每个位置是什么,只关心某物是什么。

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
l_bin = [sum(2**i if k else 0 for i,k in enumerate(tup)) for tup in l]
# [3, 7, 8, 9, 2]
# [0b0011, 0b0111, 0b1000, 0b1001, 0b0010]
# that it's backwards doesn't really matter, since it's consistent

现在,我们可以遍历该列表并构建一个没有“重复”的新数据结构。由于我们将元组编码为二进制,因此我们可以通过按位运算确定重复的,被另一个“包含” - 给定a并且b,如果a | b == a,则a必须包含b

codes = {}
for tup, b in zip(l, l_bin):
    # check if any existing code contains the potential new one
    # in this case, skip adding the new one
    if any(a | b == a for a in codes):
        continue
    # check if the new code contains a potential existing one or more
    # in which case, replace the existing code(s) with the new code
    for a in list(codes):
        if b | a == b:
            codes.pop(a)
    # and finally, add this code to our datastructure
    codes[b] = tup

现在我们可以撤回我们的“过滤”元组列表:

output = list(codes.values())
# [('A', 'B', 'C', ''), ('A', '', '', 'D')]

请注意(A, B, C, '')包含(A, B, '', '')('', B, '', ''),并且(A, '', '', D')包含('', '', '', D),所以这应该是正确的。

从 python 3.8 开始,dict保留插入顺序,因此输出应该与元组最初出现在列表中的顺序相同。

这个解决方案不会非常有效,因为代码的数量可能会堆积起来,但它应该在 O(n) 和 O(n^2) 之间,具体取决于最后剩下的唯一代码的数量(并且因为每个元组的长度明显小于 的长度l,它应该更接近 O(n) 而不是 O(n^2)。


推荐阅读