首页 > 解决方案 > 如何使用python从庞大的数据集中删除重复值

问题描述

我想从大型数据集中删除重复值。请帮我删除它

names = [ ["john","is","good","is"], ["shawn","is","bad"],...,
 ["john","shawn","is","are"] ]

expected output : [ ["john","is,"good"],["shawn","bad"],...,["are"] ]

标签: python

解决方案


您可以使用字典来获取唯一值:

names = [ ["john","is","good","is"], ["shawn","is","bad"]]
dct = {}
uniqueNames = []
for n in names:
    temp = []
    for k in n:
        if k not in dct:
            temp.append(k)
            dct[k] = 1
    uniqueNames.append(temp)
print(uniqueNames)  

输出:

[['john', 'is', 'good'], ['shawn', 'bad']]

代码复杂度为 O(n*m) [n 是子列表的数量,m 是每个子列表中元素的数量]。由于字典复杂度的搜索是平均的:O(1)所以我们可以忽略这个


推荐阅读