首页 > 解决方案 > 如何优化python列表中的比较组合

问题描述

我有一个包含 16,000 个列表的列表。每个子列表包含一个元组和一个分数,如下所示:

mylist = [
[('only','one','time'),10.54],
[('one','time'),3.21],
[('red','hot','chili','peppers'),0.223],
[('red','hot','chili'),1.98]
]  

我的目标是遍历 mylist 中的组合,并在检测到超集子集时删除任何元素。要删除的元素基于两者之间的最低分数。所以在这个例子中,我想删除

[('one','time'),3.21],
[('red','hot','chili','peppers'),0.223]

因为 ('one','time') 是 ('only','one','time') 的子集,并且在两者之间,('one','time') 得分最低 10.54>3.21。
('red','hot','chili','peppers') 是 ('red','hot','chili') 的超集,介于两者之间,0.223<1.98

我最初的解决方案是蛮力 - 从列表中选择 2 获取每个可能的组合,然后使用 all() 函数比较子集的元组,然后删除具有 min() 分数的项目。
由于要搜索的组合数量,这表现不佳:

    from itertools import combinations
    removelist = []
    for x,y in combinations(mylist,2):
        if (all(word in x[0] for word in y[0]) or all(word in y[0] for word in x[0])):
            smallest = min([x,y],key=itemgetter(1))
            removelist.append(smallest)
    removelist = set(removelist)
    outlist = [x for x in mylist if x not in removelist]
    return outlist  

返回:

outlist = [
[('only','one','time'),10.54],
[('red','hot','chili'),1.98]
]

因此,对于约 16,000 个子列表的列表,大致为:

combinations = n! / (r! * (n-r)!)  
combinations = 16,000! / (2! * (15998)!)  
combinations = 16,000 * 15999 / 2
combinations = 127,992,000  

有没有更聪明的方法来做到这一点,减少我需要检查的 1.27 亿个项目?

标签: pythonalgorithmsortingmathcombinations

解决方案


这可能比你的快一千倍。首先,我将单词元组转换为集合以进行更简单和更快的子集检查,例如@Alexander。然后我按集合大小排序,所以我不必检查超集。(因为如果 |A| ≤ |B|,那么 A 是 B 的超集的唯一方法是如果它B,在这种情况下它也是 B 的子集)。

然后是我的主要技巧。假设我们有一个词集{'red','hot','chili'},我们想找到它是一个子集的词集。我们是否需要检查所有其他(更大或相同大小)的集合?不,只检查那些包含“红色”这个词的集合就足够了。或者只有那些“热”的。或者只有那些有“辣椒”的人。让我们用最稀有的词,即最少集合中的词(在这种情况下,我猜是“辣椒”)。

我决定将您的列表称为“歌曲”,以便谈论它们。

from collections import defaultdict

def process_list_stefan(mylist):

    # Change to sets, attach the index, and sort by number of words (many to few)
    songs = [(i, set(words), score) for i, (words, score) in enumerate(mylist)]
    songs.sort(key=lambda song: -len(song[1]))

    # Check songs against others, identify the ones to remove
    remove = set()
    songs_with_word = defaultdict(list)
    for song in songs:
        i, words1, score1 = song
        # Pick the song's rarest word
        word = min(words1, key=lambda word: len(songs_with_word[word]))
        # Go through songs containing that word
        for j, words2, score2 in songs_with_word[word]:
            if words1 <= words2:
                # Lower score loses. In case of tie, lower index loses.
                remove.add(min((score1, i), (score2, j))[1])
        # Make this song available as superset candidate
        for word in words1:
            songs_with_word[word].append(song)

    # Apply the removals
    return [song for i, song in enumerate(mylist) if i not in remove]

更新:实际上,不只是使用歌曲中最稀有的单词并遍历其所有“超集”(包含该单词的集合),而是考虑当前歌曲中的所有单词并使用它们的“超集”的交集。在我对虚构数据的测试中,它甚至快了大约 1.6 倍:

from collections import defaultdict

def process_list_stefan(mylist):

    # Change to sets, attach the index, and sort by number of words (many to few)
    songs = [(i, set(words), score) for i, (words, score) in enumerate(mylist)]
    songs.sort(key=lambda song: -len(song[1]))

    # Helper: Intersection of sets
    def intersect(sets):
        s = next(sets).copy()
        for t in sets:
            s &= t
        return s

    # Check songs against others, identify the ones to remove
    remove = set()
    songs_with_word = defaultdict(set)
    for song in songs:
        i, words1, score1 = song
        for j in intersect(songs_with_word[word] for word in words1):
            # Lower score loses. In case of tie, lower index loses.
            remove.add(min((score1, i), (mylist[j][1], j))[1])
        # Make this song available as superset candidate
        for word in words1:
            songs_with_word[word].add(i)

    # Apply the removals
    return [song for i, song in enumerate(mylist) if i not in remove]

推荐阅读