首页 > 解决方案 > 如何根据编码的分类特征制作分类器?

问题描述

我正在研究一个数据框,其中一部分如下:

age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
25, Private,226802, 11th,7, Never-married, Machine-op-inspct, Own-child, Black, Male,0,0,40, United-States, <=50K
38, Private,89814, HS-grad,9, Married-civ-spouse, Farming-fishing, Husband, White, Male,0,0,50, United-States, <=50K
28, Local-gov,336951, Assoc-acdm,12, Married-civ-spouse, Protective-serv, Husband, White, Male,0,0,40, United-States, >50K
44, Private,160323, Some-college,10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male,7688,0,40, United-States, >50K
18, ?,103497, Some-college,10, Never-married, ?, Own-child, White, Female,0,0,30, United-States, <=50K
34, Private,198693, 10th,6, Never-married, Other-service, Not-in-family, White, Male,0,0,30, United-States, <=50K
29, ?,227026, HS-grad,9, Never-married, ?, Unmarried, Black, Male,0,0,40, United-States, <=50K

删除带有“?”的行后 数据框中的值:

cat = [
    'workclass', 'education', 'marital-status', 'occupation', 'relationship',
    'race', 'sex', 'native-country', 'class'
]

# Encode sex column
df["Value"] = np.where((df["sex"] == 'Female'), 0, 1)

# Encode categorical columns
data = df.copy()
for col in cat:
    data = pd.get_dummies(data, columns=[col], prefix = [col])

现在我有一个准备好逻辑回归的数据框,可以根据其他特征对性别进行分类。但是我会一步一步来做,例如首先我打算只基于'workclass'来制作'sex'的分类器,但是workclass已经被编码为几个新列(我不知道它们的全名),那么我应该如何使逻辑回归模型仅根据所有工作类编码列对性别进行分类?然后基于其他特征的组合?另外,如何找到最好的分类器?

谢谢

标签: pythonpandas

解决方案


Pandas 为每个虚拟列添加一个前缀。基于此,您可以使 X 和 y 在每一步都相应地更改列名 -

X = data[[i for i in data.columns if 'workclass' in i]] # change 'workclass' here 
y = data['sex_ Male']

推荐阅读