首页 > 解决方案 > 删除某些行包含列表和其他整数/字符串的重复项

问题描述

我有一个数据框,我想在其中删除具有重复 ID 的行。在大多数情况下,ID 是整数和字符串。但是,某些 ID 条目是多个 ID 的列表。我无法拆分这些列表,但是在尝试删除重复项时出现错误。作为参考,我使用df = df['ID'].astype(str)了它,它对下面显示的错误没有任何影响。

df的代码:

d = {'ID': [999, 
123, 
F41,
99W21, 
662, 
123, 
[552, F430, R111], 
44482, 
F41, 
[M192, 5527, 7890, 111120]
]}

df = pd.Dataframe(data=d)

输入 df ID 列如下所示:

Index    ID
-------------
0         999
1         123
2         F41
3        99W21
4         662
5         123
6       [552, F430, R111]
7        44482
8         F41
9       [M192, 5527, 7890, 111120]

我想删除重复项,以便输出为:

Index    ID
-------------
0         999
1         123
2         F41
3        99W21
4         662
5       [552, F430, R111]
6        44482
7       [M192, 5527, 7890, 111120]

我试过df.drop_duplicates(subset=['ID'], inplace=True)这给了我错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-0186aa1e1043> in <module>
      3 # Reset index and drop CID duplicates
----> 4 df.drop_duplicates(subset=['ID'], inplace=True)
      5 df.reset_index(drop=True, inplace=True)

/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in drop_duplicates(self, subset, keep, inplace)
   4907 
   4908         inplace = validate_bool_kwarg(inplace, "inplace")
-> 4909         duplicated = self.duplicated(subset, keep=keep)
   4910 
   4911         if inplace:

/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in duplicated(self, subset, keep)
   4967 
   4968         vals = (col.values for name, col in self.items() if name in subset)
-> 4969         labels, shape = map(list, zip(*map(f, vals)))
   4970 
   4971         ids = get_group_index(labels, shape, sort=False, xnull=False)

/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in f(vals)
   4945         def f(vals):
   4946             labels, shape = algorithms.factorize(
-> 4947                 vals, size_hint=min(len(self), _SIZE_HINT_LIMIT)
   4948             )
   4949             return labels.astype("i8", copy=False), len(shape)

/usr/local/lib/python3.6/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    206                 else:
    207                     kwargs[new_arg_name] = new_arg_value
--> 208             return func(*args, **kwargs)
    209 
    210         return wrapper

/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)
    670 
    671         labels, uniques = _factorize_array(
--> 672             values, na_sentinel=na_sentinel, size_hint=size_hint, na_value=na_value
    673         )
    674 

/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py in _factorize_array(values, na_sentinel, size_hint, na_value)
    506     table = hash_klass(size_hint or len(values))
    507     uniques, labels = table.factorize(
--> 508         values, na_sentinel=na_sentinel, na_value=na_value
    509     )
    510 

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()

TypeError: unhashable type: 'list'

而且df = pd.DataFrame(np.unique(df), columns=df.columns),这给出了错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-5b335a526fd5> in <module>
      3 # Reset index and drop CID duplicates
----> 4 df = pd.DataFrame(np.unique(df), columns=df.columns)
      5 df.reset_index(drop=True, inplace=True)

<__array_function__ internals> in unique(*args, **kwargs)

/usr/local/lib/python3.6/dist-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
    260     ar = np.asanyarray(ar)
    261     if axis is None:
--> 262         ret = _unique1d(ar, return_index, return_inverse, return_counts)
    263         return _unpack_tuple(ret)
    264 

/usr/local/lib/python3.6/dist-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
    308         aux = ar[perm]
    309     else:
--> 310         ar.sort()
    311         aux = ar
    312     mask = np.empty(aux.shape, dtype=np.bool_)

TypeError: '<' not supported between instances of 'float' and 'str'

如果有办法解决这个问题,我不确定它是什么,所以任何帮助都会很有用。

标签: pythonpython-3.xpandasdataframe

解决方案


unhashable type: 'list' 错误意味着 Pandas 试图使用列表作为哈希参数。

Python 的所有不可变内置对象都是可散列的,而没有可变容器(例如列表或字典)是可散列的。

尝试将列转换为字符串并删除重复项。并将其更改回数据框

df = df['ID'].astype(str).drop_duplicates().to_frame()

推荐阅读