python - 删除某些行包含列表和其他整数/字符串的重复项
问题描述
我有一个数据框,我想在其中删除具有重复 ID 的行。在大多数情况下,ID 是整数和字符串。但是,某些 ID 条目是多个 ID 的列表。我无法拆分这些列表,但是在尝试删除重复项时出现错误。作为参考,我使用df = df['ID'].astype(str)
了它,它对下面显示的错误没有任何影响。
df的代码:
d = {'ID': [999,
123,
F41,
99W21,
662,
123,
[552, F430, R111],
44482,
F41,
[M192, 5527, 7890, 111120]
]}
df = pd.Dataframe(data=d)
输入 df ID 列如下所示:
Index ID
-------------
0 999
1 123
2 F41
3 99W21
4 662
5 123
6 [552, F430, R111]
7 44482
8 F41
9 [M192, 5527, 7890, 111120]
我想删除重复项,以便输出为:
Index ID
-------------
0 999
1 123
2 F41
3 99W21
4 662
5 [552, F430, R111]
6 44482
7 [M192, 5527, 7890, 111120]
我试过df.drop_duplicates(subset=['ID'], inplace=True)
这给了我错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-13-0186aa1e1043> in <module>
3 # Reset index and drop CID duplicates
----> 4 df.drop_duplicates(subset=['ID'], inplace=True)
5 df.reset_index(drop=True, inplace=True)
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in drop_duplicates(self, subset, keep, inplace)
4907
4908 inplace = validate_bool_kwarg(inplace, "inplace")
-> 4909 duplicated = self.duplicated(subset, keep=keep)
4910
4911 if inplace:
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in duplicated(self, subset, keep)
4967
4968 vals = (col.values for name, col in self.items() if name in subset)
-> 4969 labels, shape = map(list, zip(*map(f, vals)))
4970
4971 ids = get_group_index(labels, shape, sort=False, xnull=False)
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in f(vals)
4945 def f(vals):
4946 labels, shape = algorithms.factorize(
-> 4947 vals, size_hint=min(len(self), _SIZE_HINT_LIMIT)
4948 )
4949 return labels.astype("i8", copy=False), len(shape)
/usr/local/lib/python3.6/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
206 else:
207 kwargs[new_arg_name] = new_arg_value
--> 208 return func(*args, **kwargs)
209
210 return wrapper
/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)
670
671 labels, uniques = _factorize_array(
--> 672 values, na_sentinel=na_sentinel, size_hint=size_hint, na_value=na_value
673 )
674
/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py in _factorize_array(values, na_sentinel, size_hint, na_value)
506 table = hash_klass(size_hint or len(values))
507 uniques, labels = table.factorize(
--> 508 values, na_sentinel=na_sentinel, na_value=na_value
509 )
510
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()
TypeError: unhashable type: 'list'
而且df = pd.DataFrame(np.unique(df), columns=df.columns)
,这给出了错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-14-5b335a526fd5> in <module>
3 # Reset index and drop CID duplicates
----> 4 df = pd.DataFrame(np.unique(df), columns=df.columns)
5 df.reset_index(drop=True, inplace=True)
<__array_function__ internals> in unique(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
260 ar = np.asanyarray(ar)
261 if axis is None:
--> 262 ret = _unique1d(ar, return_index, return_inverse, return_counts)
263 return _unpack_tuple(ret)
264
/usr/local/lib/python3.6/dist-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
308 aux = ar[perm]
309 else:
--> 310 ar.sort()
311 aux = ar
312 mask = np.empty(aux.shape, dtype=np.bool_)
TypeError: '<' not supported between instances of 'float' and 'str'
如果有办法解决这个问题,我不确定它是什么,所以任何帮助都会很有用。
解决方案
unhashable type: 'list' 错误意味着 Pandas 试图使用列表作为哈希参数。
Python 的所有不可变内置对象都是可散列的,而没有可变容器(例如列表或字典)是可散列的。
尝试将列转换为字符串并删除重复项。并将其更改回数据框
df = df['ID'].astype(str).drop_duplicates().to_frame()
推荐阅读
- python - random.seed(seed) 是否在多个进程中生成相同的序列?
- base64 - python中Base64到图像的转换
- php - Pthreads 启动不超过一个线程 php
- c# - WPF:为什么 Grid 的 TextBlock 用边距值编辑隐藏?
- vim - vim 自动对在单词前跳过右双引号
- swift - 字符串字典:任何不符合协议“可解码”
- swift - 根据摩擦和张力确定 springWithDamping 和 initialSpringVelocity
- python - 分析较大文件中的一列
- ios - TabBarItem 图像出现拉伸
- ios - 创建第一个中间名和姓氏时的可选项