python - Pandas Pivot Table:按条件过滤时出错
问题描述
我有一个数据框,当值满足特定条件时,我旋转并尝试创建更新的数据框。我遇到的问题是列中的值分为两行。需要在值的第 1 行进行比较。例如,如果 col7 值为 '100.2\n11',那么我需要将 100.2 与条件进行比较,如果它满足条件,则最终数据帧应包含数据的完整值('100.2\n11')和不只是100.2。
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],
'col2': ['test1', 'test1', 'test1', 'test1', 'test2', 'test2', 'test2',
'test2', 'test3', 'test3', 'test3', 'test3', 'test4', 'test5',
'test1', 'test1'],
'col3': ['t1', 't1', 't1', 't1', 't1', 't1', 't1', 't1', 't1', 't1', 't1',
't1', 't1', 't1', 't1', 't1'],
'col4': ['input1', 'input2', 'input3', 'input4', 'input1', 'input2',
'input3', 'input4', 'input1', 'input2', 'input3', 'input5',
'input2', 'input6', 'input1', 'input1'],
'col5': ['result1', 'result2', 'result3', 'result4', 'result1', 'result2',
'result3', 'result4', 'result1', 'result2', 'result3', 'result4',
'result2', 'result1', 'result2', 'result6'],
'col6': [10, 20, 30, 40, 10, 20, 30, 40, 10, 20, 30, 50, 20, 100, 10, 10],
'col7': ['100.2\n11','101.2\n21','102.3\n34','101.4\n41','100.0\n10','103.0\n20.6','104.0\n31.2','105.0\n42','102.0\n10.2',
'87.0\n15','107.0\n32.1','110.2\n61.2','120.0\n22.4','88.0\n90','106.2\n16.2','101.1\n10.1']})
df1=df.pivot_table(values = 'col7', index = ['col4', 'col5', 'col6'], columns = ['col2'], aggfunc = 'max')
df2 = df1[((df1.groupby(level='col4').rank(ascending=False) == 1.).any(axis=1)) & (df1 >= 105).any(axis=1)]
print(df2)
我收到以下错误:
File "pandas\_libs\ops.pyx", line 107, in pandas._libs.ops.scalar_compare
TypeError: '>=' not supported between instances of 'str' and 'int'
满足条件后的最终数据透视表输出应该如下:
col2 test1 test2 test3 test4 test5
col4 col5 col6
input1 result2 10 106.2\n16.2 NaN NaN NaN NaN
input2 result2 20 101.2\n21 103.0\n20.6 87.0\n15 120.0\n22.4 NaN
input3 result3 30 102.3\n34 104.0\n31.2 107.0\n32.1 NaN NaN
input4 result4 40 101.4\n41 105.0\n42 NaN NaN NaN
input5 result4 50 NaN NaN 110.2\n61.2 NaN NaN
非常感谢任何指导。提前致谢。
解决方案
您可以使用 Pandasapplymap
创建一个仅包含第一行值的辅助数据框,df1
然后将其应用于过滤条件。
...
...
df1=df.pivot_table(values = 'col7', index = ['col4', 'col5', 'col6'], columns = ['col2'], aggfunc = 'max')
df_tmp = df1.applymap(lambda x: float(str(x).split('\n')[0]))
df2 = df1[
((df_tmp.groupby(level='col4').rank(ascending=False) == 1.).any(axis=1)) &
(df_tmp >= 105).any(axis=1)
]
print(df2)
col2 test1 test2 test3 test4 test5
col4 col5 col6
input1 result2 10 106.2\n16.2 NaN NaN NaN NaN
input2 result2 20 101.2\n21 103.0\n20.6 87.0\n15 120.0\n22.4 NaN
input3 result3 30 102.3\n34 104.0\n31.2 107.0\n32.1 NaN NaN
input4 result4 40 101.4\n41 105.0\n42 NaN NaN NaN
input5 result4 50 NaN NaN 110.2\n61.2 NaN NaN
推荐阅读
- python - mouse.move 标签不适用于我的自动点击器
- bash - Bash 脚本 - 变量连接
- python-3.x - 无法执行脚本 pyi_rth_win32api [Tkinter - Pyinstaller]
- amazon-web-services - AWS Elasticache for Redis 节点和 maxmemory 策略
- android - 仅当某些条件为真时如何更改 Switch 状态?
- python - 如何在ini或cfg文档中保存=字符?
- angular - primeng自动完成ajax不显示建议
- javascript - Chrome 扩展计时器 - API 问题?
- javascript - 响应式屏幕尺寸在小屏幕上播放
- reactjs - react-select Creatable:转换创建的选项