首页 > 解决方案 > 对如何将标签编码值与原始值一起使用感到困惑

问题描述

嗨,我正在尝试处理数据集同时包含数字和字母值的 ML 项目。我使用LabelEncoder()sklearn 成功将字母值转换为数字,但我无法在“X”“y”变量中添加所有必需的值。这是我的代码

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
data = pd.read_csv('data-set.csv')

num_val = preprocessing.LabelEncoder()
gender = num_val.fit_transform(list(data['gender']))
ever_married = num_val.fit_transform(list(data['ever_married']))
work_type = num_val.fit_transform(list(data['work_type']))
Residence_type = num_val.fit_transform(list(data['Residence_type']))
smoking_status = num_val.fit_transform(list(data['smoking_status']))

predict = "stroke"

X = list(zip(gender,ever_married,work_type,Residence_type,smoking_status))
y = data['stroke']

X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.1)

model = SVC()

model.fit(X_train, y_train)

pred = model.predict(X_test)

acc = accuracy_score(y_test, pred)
print(acc)

我使用的数据集在这里

如何将“X”变量中的所有值和数据集中的其他值加在一起(更改的值和未更改的数值。请帮助

标签: pythonpandasmachine-learningscikit-learndeep-learning

解决方案


将 Pandasapply与具有相同代码的函数(transform在下面的示例中)一起使用,但使用columns要在原始数据帧 ( data) 上转换的列表。接下来,从数据框中删除目标列(stroke在此特定数据集中)以创建X变量。您还必须bmi使用与您的分析相关的内容填充 NaN 值,否则该fit函数将引发ValueError.

...
data = pd.read_csv('healthcare-dataset-stroke-data.csv')
print(data.head())

def transform(series):
    num_val = preprocessing.LabelEncoder()
    np_array = num_val.fit_transform(list(series))
    return pd.Series(np_array)

t_list = ["gender","ever_married","work_type","Residence_type","smoking_status"]

data[t_list] = data[t_list].apply(transform)
print(data.head())

predict = "stroke"

X = data.drop(columns=['stroke'])
# fill "bmi" NaN values with something relevant to your analysis
X = X.fillna(X.median())
y = data['stroke']

X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.1)
...

原始数据框

      id  gender   age  ...     work_type Residence_type  avg_glucose_level   bmi   smoking_status  stroke
0   9046    Male  67.0  ...       Private          Urban             228.69  36.6  formerly smoked       1
1  51676  Female  61.0  ... Self-employed          Rural             202.21   NaN     never smoked       1
2  31112    Male  80.0  ...       Private          Rural             105.92  32.5     never smoked       1
3  60182  Female  49.0  ...       Private          Urban             171.23  34.4           smokes       1
4   1665  Female  79.0  ... Self-employed          Rural             174.12  24.0     never smoked       1

转换后的数据框

      id  gender   age  ... work_type  Residence_type  avg_glucose_level   bmi  smoking_status  stroke
0   9046       1  67.0  ...         2               1             228.69  36.6               1       1
1  51676       0  61.0  ...         3               0             202.21   NaN               2       1
2  31112       1  80.0  ...         2               0             105.92  32.5               2       1
3  60182       0  49.0  ...         2               1             171.23  34.4               3       1
4   1665       0  79.0  ...         3               0             174.12  24.0               2       1

推荐阅读