首页 > 解决方案 > 代码中的问题“无法将字符串转换为浮点数”

问题描述

我正在从 github 链接学习线性回归“ https://github.com/Anubhav1107/Machine_Learning_A-Z/blob/master/Part%202%20-%20Regression/Section%205%20-%20Multiple%20Linear%20Regression /multiple_linear_regression.py "

但是当我尝试制作它时,会发生这种情况:

ValueError                                Traceback (most recent call last)
<ipython-input-26-860be404cdc9> in <module>()
      1 sc_y = StandardScaler()
----> 2 y_train = sc_y.fit_transform(y_train)

4 frames
/usr/local/lib/python3.6/dist-packages/numpy/core/numeric.py in asarray(a, dtype, order)
    536 
    537     """
--> 538     return array(a, dtype, copy=False, order=order)
    539 
    540 

ValueError: could not convert string to float: 'Florida'

我在 Google Colab 上运行它,我已经转换了分类特征,所以我不明白问题是什么。

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()


# Splitting the dataset into the Training set and Test set

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)

标签: pythonmachine-learningscikit-learn

解决方案


在How to create a Minimal, Reproducible Example中,我们要求这样做是有原因的:

确保重现问题所需的所有信息都包含在问题本身中

而不是在某些外部文件中,您可能或您可能没有正确执行其中的部分。

我这样说是因为我无法重现您的错误;执行链接代码的相关部分在这里可以正常工作:

import numpy as np
import pandas as pd
import sklearn
sklearn.__version__
# '0.21.3'

# Importing the dataset
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()

# Avoiding the Dummy Variable Trap
X = X[:, 1:]

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split  # model_selection here, due to newer version of scikit_learn
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# FutureWarning here, irrelevant to the issue

在这个阶段,我们有:

y_train
# result:
array([ 96778.92,  96479.51, 105733.54,  96712.8 , 124266.9 , 155752.6 ,
       132602.65,  64926.08,  35673.41, 101004.64, 129917.04,  99937.59,
        97427.84, 126992.93,  71498.49, 118474.03,  69758.98, 152211.77,
       134307.35, 107404.34, 156991.12, 125370.37,  78239.91,  14681.4 ,
       191792.06, 141585.52,  89949.14, 108552.04, 156122.51, 108733.99,
        90708.19, 111313.02, 122776.86, 149759.96,  81005.76,  49490.75,
       182901.99, 192261.83,  42559.73,  65200.33])

我敢打赌,您的(未显示)完整代码并非如此。

稍微修改下面的最后一行y_train.reshape(-1,1)(同样,与问题无关 - 如果不是我们得到一个不同的错误,要求这样做),我们有:

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train.reshape(-1,1))  # reshape here

这工作正常,给

y_train
# result
array([[-0.31304376],
       [-0.32044287],
       [-0.09175449],
       [-0.31467774],
       [ 0.3662475 ],
       [ 1.14433163],
       [ 0.57224308],
       [-1.10020076],
       [-1.82310158],
       [-0.20861649],
       [ 0.50587547],
       [-0.23498575],
       [-0.29700745],
       [ 0.43361398],
       [-0.93778138],
       [ 0.22309235],
       [-0.98076868],
       [ 1.05682957],
       [ 0.61437014],
       [-0.05046517],
       [ 1.17493831],
       [ 0.39351679],
       [-0.77118537],
       [-2.34186247],
       [ 2.03494965],
       [ 0.79423047],
       [-0.48182335],
       [-0.02210286],
       [ 1.15347296],
       [-0.01760646],
       [-0.46306547],
       [ 0.04612731],
       [ 0.32942519],
       [ 0.9962397 ],
       [-0.70283485],
       [-1.4816433 ],
       [ 1.81525556],
       [ 2.04655875],
       [-1.65292476],
       [-1.09342341]])

显然y = dataset.iloc[:, 4].values,您要求的是 ,而不是y = dataset.iloc[:, 3].values,这给出了:

dataset.iloc[:, 3].values
# result:
array(['New York', 'California', 'Florida', 'New York', 'Florida',
       'New York', 'California', 'Florida', 'New York', 'California',
       'Florida', 'California', 'Florida', 'California', 'Florida',
       'New York', 'California', 'New York', 'Florida', 'New York',
       'California', 'New York', 'Florida', 'Florida', 'New York',
       'California', 'Florida', 'New York', 'Florida', 'New York',
       'Florida', 'New York', 'California', 'Florida', 'California',
       'New York', 'Florida', 'California', 'New York', 'California',
       'California', 'Florida', 'California', 'New York', 'California',
       'New York', 'Florida', 'California', 'New York', 'California'],
      dtype=object)

有了这个改变,上面的代码确实给出了:

y_train
# result:
array(['Florida', 'New York', 'Florida', 'California', 'Florida',
       'Florida', 'Florida', 'New York', 'New York', 'New York',
       'New York', 'Florida', 'California', 'California', 'California',
       'California', 'New York', 'New York', 'California', 'California',
       'New York', 'New York', 'California', 'California', 'California',
       'Florida', 'California', 'New York', 'California', 'Florida',
       'Florida', 'New York', 'New York', 'California', 'California',
       'Florida', 'New York', 'New York', 'California', 'California'],
      dtype=object)

最终:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-4a9512e0c95c> in <module>
      5 X_test = sc_X.transform(X_test)
      6 sc_y = StandardScaler()
----> 7 y_train = sc_y.fit_transform(y_train.reshape(-1,1))

[...]
ValueError: could not convert string to float: 'Florida'

推荐阅读