首页 > 解决方案 > 在python中删除重复的对象列表

问题描述

如何对 python 中的对象列表进行重复数据删除,以便 当且仅当list_of_objects[i] is list_of_objects[j]返回?truei == j

例子:

我有两个数字集群,我构建了一个字典,其中数字作为键,值作为集群

a = {1,2,3}
b = {4,5,6}
cur_dict = {1:a, 2:a, 3:a, 4:b, 5:b, 6:b}
duplicated_clusters = list(cur_dict.values())
duplicated_clusters
# [{1, 2, 3}, {1, 2, 3}, {1, 2, 3}, {4, 5, 6}, {4, 5, 6}, {4, 5, 6}]
# How to process duplicated_clusters to get [{1, 2, 3}, {4, 5, 6}]?

# Obviously set(duplicated_clusters) is not working because set is not hashable and mutable. 

由于python中没有指针,我如何获得去重对象的列表(或者它是无法实现的)?(我可以想到一些解决方法,但对我来说不是直截了当,例如使用额外的标识符或将每个对象包装到包装器类中)。

# An example workaround but I want to have a more straight-forward way
a = {1,2,3}
b = {4,5,6}
cluster_dict = {"clusterA": a, "clusterB": b}
cur_dict = {1:"clusterA", 2:"clusterA", 3:"clusterA", 4:"clusterB", 5:"clusterB", 6:"clusterB"}
duplicated_cluster_names = list(cur_dict.values())
deduplicated_clusters = [cluster_dict[name] for name in set(duplicated_cluster_names)]
deduplicated_clusters
# [{1, 2, 3}, {4, 5, 6}]

示例 2:

感谢@wjandrea 评论,添加了一个示例以提高清晰度。

a = {1,2,3}
b = {4,5,6}
c = {1,2,3}
duplicated_clusters = [a,a,b,b,c,c]
duplicated_clusters
# [{1, 2, 3}, {1, 2, 3}, {4, 5, 6}, {4, 5, 6}, {1, 2, 3}, {1, 2, 3}]
# Deduplicated clusters I want to obtain: [{1, 2, 3}, {4, 5, 6}, {1, 2, 3}], equivalent to [a,b,c]

标签: python

解决方案


id函数返回一个在对象生命周期内唯一且恒定的值。您可以将其用作识别重复对象的关键。

a = {1,2,3}
b = {4,5,6}
cur_dict = {1:a, 2:a, 3:a, 4:b, 5:b, 6:b}
duplicated_clusters = list(cur_dict.values())
result = list({id(x): x for x in duplicated_clusters}.values())
print(result)

结果:

[{1, 2, 3}, {4, 5, 6}]

“python中没有指针”只是大部分都是真的。在 CPython 中,id返回对象在内存中的地址,因此它实际上是指向该对象的指针。但是这种方法甚至适用id于与内存地址没有任何关系的更奇特的实现。只要a is b暗示id(a) == id(b),反之亦然,那么这种方法应该消除引用重复。


...话虽如此,请记住 Python 经常“实习”某些类型的内置值,因此您认为可能具有引用唯一性的对象实际上可能是同一个对象。考虑这个例子:

a = {1,2,3}
b = {1,2,3}
c = (4,5,6)
d = (4,5,6)
e = int("23")           #the parser doesn't know what value this will be until runtime
f = 23
g = int("456789101112") #the parser doesn't know what value this will be until runtime
h = 456789101112
i = 456789101111+1      #the parser knows at compile time that this evaluates to 456789101112
cur_dict = {1:a, 2:b, 3:c, 4:d, 5:e, 6:f, 7:g, 8:h, 9:i}
duplicated_clusters = list(cur_dict.values())
result = list({id(x): x for x in duplicated_clusters}.values())
print(result)

结果(在 CPython 中):

[{1, 2, 3}, {1, 2, 3}, (4, 5, 6), 23, 456789101112, 456789101112]

集合是可变的,因此它们永远不会被实习。元组是不可变的,因此它们可能会被实习。小整数被实习,即使您不遗余力地以解析器无法在编译时猜测其值的方式创建它们。大 int 通常不会被实习,尽管如果两个大 int 值是使用算术表达式创建的,并且可以在编译时优化为单个常量,那么它们仍然可以在引用上相同。


推荐阅读