python - 如何优化python列表中的比较组合
问题描述
我有一个包含 16,000 个列表的列表。每个子列表包含一个元组和一个分数,如下所示:
mylist = [
[('only','one','time'),10.54],
[('one','time'),3.21],
[('red','hot','chili','peppers'),0.223],
[('red','hot','chili'),1.98]
]
我的目标是遍历 mylist 中的组合,并在检测到超集或子集时删除任何元素。要删除的元素基于两者之间的最低分数。所以在这个例子中,我想删除
[('one','time'),3.21],
[('red','hot','chili','peppers'),0.223]
因为 ('one','time') 是 ('only','one','time') 的子集,并且在两者之间,('one','time') 得分最低 10.54>3.21。
('red','hot','chili','peppers') 是 ('red','hot','chili') 的超集,介于两者之间,0.223<1.98
我最初的解决方案是蛮力 - 从列表中选择 2 获取每个可能的组合,然后使用 all() 函数比较子集的元组,然后删除具有 min() 分数的项目。
由于要搜索的组合数量,这表现不佳:
from itertools import combinations
removelist = []
for x,y in combinations(mylist,2):
if (all(word in x[0] for word in y[0]) or all(word in y[0] for word in x[0])):
smallest = min([x,y],key=itemgetter(1))
removelist.append(smallest)
removelist = set(removelist)
outlist = [x for x in mylist if x not in removelist]
return outlist
返回:
outlist = [
[('only','one','time'),10.54],
[('red','hot','chili'),1.98]
]
因此,对于约 16,000 个子列表的列表,大致为:
combinations = n! / (r! * (n-r)!)
combinations = 16,000! / (2! * (15998)!)
combinations = 16,000 * 15999 / 2
combinations = 127,992,000
有没有更聪明的方法来做到这一点,减少我需要检查的 1.27 亿个项目?
解决方案
这可能比你的快一千倍。首先,我将单词元组转换为集合以进行更简单和更快的子集检查,例如@Alexander。然后我按集合大小排序,所以我不必检查超集。(因为如果 |A| ≤ |B|,那么 A 是 B 的超集的唯一方法是如果它是B,在这种情况下它也是 B 的子集)。
然后是我的主要技巧。假设我们有一个词集{'red','hot','chili'},我们想找到它是一个子集的词集。我们是否需要检查所有其他(更大或相同大小)的集合?不,只检查那些包含“红色”这个词的集合就足够了。或者只有那些“热”的。或者只有那些有“辣椒”的人。让我们用最稀有的词,即最少集合中的词(在这种情况下,我猜是“辣椒”)。
我决定将您的列表称为“歌曲”,以便谈论它们。
from collections import defaultdict
def process_list_stefan(mylist):
# Change to sets, attach the index, and sort by number of words (many to few)
songs = [(i, set(words), score) for i, (words, score) in enumerate(mylist)]
songs.sort(key=lambda song: -len(song[1]))
# Check songs against others, identify the ones to remove
remove = set()
songs_with_word = defaultdict(list)
for song in songs:
i, words1, score1 = song
# Pick the song's rarest word
word = min(words1, key=lambda word: len(songs_with_word[word]))
# Go through songs containing that word
for j, words2, score2 in songs_with_word[word]:
if words1 <= words2:
# Lower score loses. In case of tie, lower index loses.
remove.add(min((score1, i), (score2, j))[1])
# Make this song available as superset candidate
for word in words1:
songs_with_word[word].append(song)
# Apply the removals
return [song for i, song in enumerate(mylist) if i not in remove]
更新:实际上,不只是使用歌曲中最稀有的单词并遍历其所有“超集”(包含该单词的集合),而是考虑当前歌曲中的所有单词并使用它们的“超集”的交集。在我对虚构数据的测试中,它甚至快了大约 1.6 倍:
from collections import defaultdict
def process_list_stefan(mylist):
# Change to sets, attach the index, and sort by number of words (many to few)
songs = [(i, set(words), score) for i, (words, score) in enumerate(mylist)]
songs.sort(key=lambda song: -len(song[1]))
# Helper: Intersection of sets
def intersect(sets):
s = next(sets).copy()
for t in sets:
s &= t
return s
# Check songs against others, identify the ones to remove
remove = set()
songs_with_word = defaultdict(set)
for song in songs:
i, words1, score1 = song
for j in intersect(songs_with_word[word] for word in words1):
# Lower score loses. In case of tie, lower index loses.
remove.add(min((score1, i), (mylist[j][1], j))[1])
# Make this song available as superset candidate
for word in words1:
songs_with_word[word].add(i)
# Apply the removals
return [song for i, song in enumerate(mylist) if i not in remove]
推荐阅读
- javascript - Javascript-Selenium WebDriverJS (Not Java using ashot)- 整页截图
- java - 在java中使用use和this关键字的正确方法是什么?
- android - 如何自定义 Floating-ArcMenu(卫星菜单选项)
- c# - 来自 ... tls 的 TLS 握手错误:客户端未提供证书
- dialogflow-es - 如何在 Google 上的操作中为每个 webhook 请求获取设备粗略位置
- ruby-on-rails - 在 Rails 迁移中为数据类型数组添加默认值
- celery - '*reply.celery.pidbox' 代表什么
- flutter - Flutter FutureBuilder返回Null错误触发
- javascript - 在谷歌地图中如何限制印度单个城市的自动搜索
- html - 使用 html 和 css 的粗体圆形和彩色列出的项目