python - 将多个 pandas 列(第 1 和第 2、第 3 和第 4 之后、之后等)与矢量化(更好)或其他方法进行比较
问题描述
此代码基于and进行比较condition
,并基于(此代码运行良好)创建:var1
var2
Results1
choices
# from: https://stackoverflow.com/questions/27474921/compare-two-columns-using-pandas?answertab=oldest#tab-top
# from: https://stackoverflow.com/questions/60099141/negation-in-np-select-condition
import pandas as pd
import numpy as np
# Creating one column from two columns. We asume that in every row there is one NaN and one value and that value fills new column.
df = pd.DataFrame({ 'var1': ['a', 'b', 'c',np.nan, np.nan],
'var2': [1, 2, np.nan , 4, np.nan],
'var3': [np.nan , "x", np.nan, "y", "z"],
'var4': [np.nan , 4, np.nan, 5, 6],
'var5': ["a", np.nan , "b", np.nan, "c"],
'var6': [1, np.nan , 2, np.nan, 3]
})
#all conditions that are connected with logical operators (&, |, etc) should be in ().
conditions = [
(df["var1"].notna()) & (df['var2'].notna()),
(pd.isna(df["var1"])) & (pd.isna(df["var2"])),
(df["var1"].notna()) & (pd.isna(df["var2"])),
(pd.isna(df["var1"])) & (df['var2'].notna())]
choices = ["Both values", np.nan, df["var1"], df["var2"]]
df['Result1'] = np.select(conditions, choices, default=np.nan)
df
看起来应该:
| | var1 | var2 | var3 | var4 | var5 | var6 | Result1 |
|---:|:-------|-------:|:-------|-------:|:-------|-------:|:------------|
| 0 | a | 1 | nan | nan | a | 1 | Both values |
| 1 | b | 2 | x | 4 | nan | nan | Both values |
| 2 | c | nan | nan | nan | b | 2 | c |
| 3 | nan | 4 | y | 5 | nan | nan | 4 |
| 4 | nan | nan | z | 6 | c | 3 | nan |
现在我想比较多个 pandas 列(在我的示例中var1
andvar2
之后var3
andvar4
之后var5
and之后var6
)并基于condition
并choices
创建相应Results
的列(在我的示例Result1
中 , Result2
, Result3
)。我认为最好的方法应该是使用矢量化(因为性能更好)。df
我想得到的应该是这样的:
| | var1 | var2 | var3 | var4 | var5 | var6 | Result1 | Result2 | Result3 |
|---:|:-------|-------:|:-------|-------:|:-------|-------:|:------------|:------------|:------------|
| 0 | a | 1 | nan | nan | a | 1 | Both values | nan | Both values |
| 1 | b | 2 | x | 4 | nan | nan | Both values | Both values | nan |
| 2 | c | nan | nan | nan | b | 2 | c | nan | Both values |
| 3 | nan | 4 | y | 5 | nan | nan | 4 | Both values | nan |
| 4 | nan | nan | z | 6 | c | 3 | nan | Both values | Both values |
我试过这个:
import pandas as pd
import numpy as np
# Creating one column from two columns. We asume that in every row there is one NaN and one value and that value fills new column.
df = pd.DataFrame({ 'var1': ['a', 'b', 'c',np.nan, np.nan],
'var2': [1, 2, np.nan , 4, np.nan],
'var3': [np.nan , "x", np.nan, "y", "z"],
'var4': [np.nan , 4, np.nan, 5, 6],
'var5': ["a", np.nan , "b", np.nan, "c"],
'var6': [1, np.nan , 2, np.nan, 3]
})
col1 = ["var1", "var3", "var5"]
col2 = ["var2", "var4", "var6"]
colR = ["Result1", "Result2", "Result3"]
#all conditions that are connected with logical operators (&, |, etc) should be in ().
conditions = [
(df[col1].notna()) & (df[col2].notna()),
(pd.isna(df[col1])) & (pd.isna(df[col2])),
(df[col1].notna()) & (pd.isna(df[col2])),
(pd.isna(df[col1])) & (df[col2].notna())]
choices = ["Both values", np.nan, df[col1], df[col2]]
df[colR] = np.select(conditions, choices, default=np.nan)
买它给了我错误:
ValueError: shape mismatch: objects cannot be broadcast to a single shape
问题:如何通过矢量化(由于性能更好)或其他方法来实现我的目标?
解决方案
问题是pandas
DataFrame 强制索引对齐,但df[col1]
没有df[col2]
重叠列。
在这种情况下,您真的想使用底层的 numpy 数组。也因为.isnull()
是你的对立面notnull
可以简化很多。我们将继续添加新列。
col1 = ["var1", "var3", "var5"]
col2 = ["var2", "var4", "var6"]
colR = ["Result1", "Result2", "Result3"]
s1 = df[col1].isnull().to_numpy()
s2 = df[col2].isnull().to_numpy()
conditions = [~s1 & ~s2, s1 & s2, ~s1 & s2, s1 & ~s2]
choices = ["Both values", np.nan, df[col1], df[col2]]
df = pd.concat([df, pd.DataFrame(np.select(conditions, choices), columns=colR, index=df.index)], axis=1)
var1 var2 var3 var4 var5 var6 Result1 Result2 Result3
0 a 1.0 NaN NaN a 1.0 Both values NaN Both values
1 b 2.0 x 4.0 NaN NaN Both values Both values NaN
2 c NaN NaN NaN b 2.0 c NaN Both values
3 NaN 4.0 y 5.0 NaN NaN 4 Both values NaN
4 NaN NaN z 6.0 c 3.0 NaN Both values Both values
推荐阅读
- mongodb - 如何在 mongodb 查询或通过 mongoose 检索 n 级嵌套类别树
- python -
我的分页器生病了吗?我在做什么错? - python - nginx-gunicorn Flask 应用程序未指向我在 CentoOS 7 上的应用程序目录
- excel - 努力在网页上单击带有 VBA 代码的继续按钮
- python - 使用函数从二维数组转换到不同坐标系的最有效方法,然后对结果孔进行插值
- python - 在 Python 3 中给定两个值,加速分配概率密度
- ruby-on-rails - 从不同的 Active Record 模型/表中检索一条数据
- flutter - FormBuilderCheckboxList 内部值已重置,但如果使用 INITIAL VALUE,复选框仍处于选中状态(未清除)
- sql-server-data-tools - 如何在 DACPAC 部署中仅部署一组选定的存储过程?
- scala - Scala & json4s:如何过滤 json 数组