python - 如何防止标签编码器将 y 列添加到 X numpy.array
问题描述
我在下面有以下代码,但是标签编码器的最后一行
X = MultiColumnLabelEncoder(columns = ['newlyConst','balcony', 'cellar', 'lift', 'garden', ]).fit_transform(df)
将 y 列(租)添加到 X numpy.array 中。
我不确定如何以另一种方式指定要编码的列以防止此问题,例如,通过指定 X np 数组和特定列而不是通过 df ,因为当我收到索引错误时。
任何帮助都会很棒,谢谢!
更新 我用长标签编码器代替了一个更优雅的解决方案,正如我的@Corralien所说的那样——在这里找到了深入的信息Converting Pandas Types
替换:
df = df.astype({"newlyConst" :int, "balcony" : int, "cellar" : int, "lift" : int, "garden":int})
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import linear_model
df = pd.read_csv('immo_data.csv')
df.drop(columns=['telekomTvOffer', 'telekomHybridUploadSpeed', 'pricetrend',
'telekomUploadSpeed', 'scoutId', 'noParkSpaces', 'yearConstructedRange',
'houseNumber', 'interiorQual', 'petsAllowed', 'street', 'streetPlain', 'baseRentRange',
'geo_plz','geo_bln', 'geo_krs','thermalChar', 'floor','numberOfFloors', 'noRoomsRange', 'livingSpaceRange',
'regio3', 'description', 'facilities', 'hasKitchen','heatingCosts', 'energyEfficiencyClass',
'lastRefurbish', 'electricityBasePrice', 'electricityKwhPrice','date','condition', 'typeOfFlat','serviceCharge'
,'heatingType','firingTypes', 'yearConstructed'], axis=1, inplace = True)
df_head=df.head(250)
df_nan_count=df.isna().sum()
#With 'firingTypes', 'yearConstructed', 'condition', 'typeOfFlat' number of NaN values exceeding 40-50%, those will be dropped
df.dropna(inplace=True)
df3=df.count()
df=df[['regio1', 'newlyConst', 'balcony', 'picturecount', 'cellar', 'livingSpace',
'lift','noRooms', 'garden', 'baseRent', 'totalRent']]
dfcount = df.nunique()
##Regression
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
#Encoding Categorical Data
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
le = LabelEncoder()
class MultiColumnLabelEncoder:
def __init__(self,columns = None):
self.columns = columns # array of column names to encode
def fit(self,X,y=None):
return self # not relevant here
'''
Transforms columns of X specified in self.columns using
LabelEncoder(). If no columns specified, transforms all
columns in X.
'''
def transform(self,X):
output = X.copy()
if self.columns is not None:
for col in self.columns:
output[col] = LabelEncoder().fit_transform(output[col])
else:
for colname,col in output.iteritems():
output[colname] = LabelEncoder().fit_transform(col)
return output
def fit_transform(self,X,y=None):
return self.fit(X,y).transform(X)
X = MultiColumnLabelEncoder(columns = ['newlyConst','balcony', 'cellar', 'lift', 'garden', ]).fit_transform(df)
# Encoding categorical data
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
df
regio1 newlyConst balcony picturecount cellar livingSpace lift noRooms garden totalRent
0 Nordrhein_Westfalen 0 0 6 1 86.00 0 4.0 1 840.00
2 Sachsen 1 1 8 1 83.80 1 3.0 0 1300.00
4 Bremen 0 1 19 0 84.97 0 3.0 0 903.00
6 Sachsen 0 0 9 1 62.00 0 2.0 1 380.00
7 Bremen 0 1 5 1 60.30 0 3.0 0 584.25
8 Baden_Württemberg 0 0 5 1 53.00 0 2.0 0 690.00
10 Sachsen 0 1 11 1 40.20 0 2.0 0 307.00
11 Sachsen 0 0 9 1 80.00 0 3.0 1 555.00
12 Rheinland_Pfalz 0 0 4 0 100.00 0 4.0 1 920.00
13 Nordrhein_Westfalen 0 0 3 0 123.44 0 4.0 0 1150.00
解决方案
答案是删除标签编码器,而是使用下面的代码将真/假值更改为整数。更多信息可以在这里找到改变 Pandas 数据类型
df = df.astype({"newlyConst" :int, "balcony" : int, "cellar" : int, "lift" : int, "garden":int})
推荐阅读
- python - 有没有办法在 Django 中序列化多个对象?
- javascript - 如何使用 React Hook 处理多个复选框
- java - 为什么“LinkedBlockingQueue#put”需要“notFull.signal()”
- c++ - 推送一些数据后,C++ Vector 在实例类中变为空
- amazon-ec2 - 如何使用 AWS EC2 设置 RabbitMQ 集群
- spring-data - Spring Boot Couchbase Reactive 不支持分页
- django - 可以在 TemplateView 中使用 FormMixin 吗?
- html - 为什么底部和顶部文本不在 3D Cuboid 中居中
- sql - MariaDB 创建视图将 SELECT 更改为不同的(不正确的)查询
- python - 比较两个 YAML 文件中的键并打印差异?