pandas - 在需要特定操作顺序时创建辅助变量
问题描述
我正在使用一个典型的数据集(行是观察值,列是变量)
我需要根据数据集中的两个原始变量创建一个新变量。逻辑需要包含正确的操作顺序...即 if (a = 1 and b >= 10) or (a = 2 and b >= 20)... 等等。我可以在 SAS 中轻松做到这一点(发布在下面),但我正在将一些工作翻译成 python。我的尝试在这里列出。我也不知道如何在逻辑中处理 NaN。如果任一原始变量为 NaN,则新变量也应为 NaN。我感谢您的帮助。
def OLDER4GRADE (row) :
if (row['H1GI20'] == 7 and row['AGE'] >= 14)
or (row['H1GI20'] == 8 and row['AGE'] >= 15)
or (row['H1GI20'] == 9 and row['AGE'] >= 16)
or (row['H1GI20'] == 10 and row['AGE'] >= 17)
or (row['H1GI20'] == 11 and row['AGE'] >= 18)
or (row['H1GI20'] == 12 and row['AGE'] >= 19:
return 1
else :
return 0
data['OLDER4GRADE'] = data.apply(lambda row: OLDER4GRADE (row), axis = 1)
这是 SAS 中的样子
if H1GI20 EQ . or AGE1 eq . then OLDER4GRADE=.;
else if (H1GI20=7 and AGE1 GE 14) or (H1GI20=8 and AGE1 GE 15) or (H1GI20=9 and AGE1 GE 16) or
(H1GI20=10 and AGE1 GE 17) or (H1GI20=11 and AGE1 GE 18) or (H1GI20=12 and AGE1 GE 19)
then OLDER4GRADE=1;
else OLDER4GRADE=0;
解决方案
让我们首先修复您的代码:
import numpy as np
def OLDER4GRADE (row) :
# handle `nan`
# you check for first in the SAS code as well
if np.isnan(row['H1GI20']) or np.isnan(row['AGE']): return np.nan
if (row['H1GI20'] == 7 and row['AGE'] >= 14)
or (row['H1GI20'] == 8 and row['AGE'] >= 15)
or (row['H1GI20'] == 9 and row['AGE'] >= 16)
or (row['H1GI20'] == 10 and row['AGE'] >= 17)
or (row['H1GI20'] == 11 and row['AGE'] >= 18)
or (row['H1GI20'] == 12 and row['AGE'] >= 19:
return 1
else :
return 0
# apply the function is good enough, no need `lambda`
data['OLDER4GRADE'] = data.apply(OLDER4GRADE, axis = 1)
现在,对于 Pandas,建议尽可能避免apply
沿行。您的逻辑可以翻译为:
# rows with `nan` in either column
invalid = data[['H1GI20', 'AGE']].isna().any(axis=1)
# the threshold for each category
thresholds = {
7: 14,
8: 15,
9: 16,
10: 17,
11: 18,
12: 19
}
# use `map` to turn `H1GI20` into respective threshold and compare
above_thresh = data['H1GI20'].map(thresholds) >= data['AGE']
data['OLDER4GRADE'] = np.where(invalid, np.nan, above_thresh.astype(int))