pandas - 查询包含列表的列
问题描述
我有一个数据框,其中包含带有列表的列。我该如何查询这些?
>>> df1.shape
(1812871, 7)
>>> df1.dtypes
CHROM object
POS int32
ID object
REF object
ALT object
QUAL int8
FILTER object
dtype: object
>>> df1.head()
CHROM POS ID REF ALT QUAL FILTER
0 20 60343 rs527639301 G [A] 100 [PASS]
1 20 60419 rs538242240 A [G] 100 [PASS]
2 20 60479 rs149529999 C [T] 100 [PASS]
3 20 60522 rs150241001 T [TC] 100 [PASS]
4 20 60568 rs533509214 A [C] 100 [PASS]
>>> df2 = df1.head(30)
>>> df3 = df1.head(3000)
我发现了一个先前的问题,但解决方案对我来说不太适用。接受的解决方案不起作用:
>>> df2[df2.ALT.apply(lambda x: x == ['TC'])]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2682, in __getitem__
return self._getitem_array(key)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2726, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1314, in _convert_to_indexer
indexer = check = labels.get_indexer(objarr)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3259, in get_indexer
indexer = self._engine.get_indexer(target._ndarray_values)
File "pandas/_libs/index.pyx", line 301, in pandas._libs.index.IndexEngine.get_indexer
File "pandas/_libs/hashtable_class_helper.pxi", line 1544, in pandas._libs.hashtable.PyObjectHashTable.lookup
TypeError: unhashable type: 'numpy.ndarray'
原因是,布尔值嵌套:
>>> df2.ALT.apply(lambda x: x == ['TC']).head()
0 [False]
1 [False]
2 [False]
3 [True]
4 [False]
Name: ALT, dtype: object
所以我尝试了第二个答案,这似乎有效:
>>> c = np.empty(1, object)
>>> c[0] = ['TC']
>>> df2[df2.ALT.values == c]
CHROM POS ID REF ALT QUAL FILTER
3 20 60522 rs150241001 T [TC] 100 [PASS]
但奇怪的是,当我在更大的数据帧上尝试它时它不起作用:
>>> df3[df3.ALT.values == c]
Traceback (most recent call last):
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: False
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2688, in __getitem__
return self._getitem_column(key)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
return self._get_item_cache(key)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
values = self._data.get(item)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/internals.py", line 4115, in get
loc = self.items.get_loc(item)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: False
这可能是因为布尔比较的结果不同!
>>> df3.ALT.values == c
False
>>> df2.ALT.values == c
array([False, False, False, True, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False])
这对我来说完全莫名其妙。
解决方案
我找到了一个 hacky 解决方案,将列表作为元组适用于我
df = pd.DataFrame({'CHROM': [20] *5,
'POS': [60343, 60419, 60479, 60522, 60568],
'ID': ['rs527639301', 'rs538242240', 'rs149529999', 'rs150241001', 'rs533509214'],
'REF': ['G', 'A', 'C', 'T', 'A'],
'ALT': [['A'], ['G'], ['T'], ['TC'], ['C']],
'QUAL': [100] * 5,
'FILTER': [['PASS']] * 5})
df['ALT'] = df['ALT'].apply(tuple)
df[df['ALT'] == ('C',)]
此方法之所以有效,是因为元组的不变性允许 pandas 与您为布尔系列获得的列表内元素比较相比检查整个元素是否正确,因为列表不可散列。
推荐阅读
- javascript - 有没有办法使用按键将焦点写入('|')从输入更改为另一个?
- python - 我从 Github 将 repo 导入到 Colab 运行它时没有响应
- sql - 如何计算 Presto 中数组元素的出现次数?
- python - Django - 调用 .super().clean() 后未列出电子邮件字段
- git - Git Hooks:使分支包含特定字符
- python - 使用 if 循环排除包含子字符串的条目
- r - 用R中的宽限日和月偿还计算
- javascript - React 使用多字标识符解析 JSON 数据
- python - 我不明白为什么这个字典条目循环没有输入正确的键和值
- r - 将 marrangGrob 标题与绘图的左边界对齐