首页 > 解决方案 > FitFailedWarning in Simple Linear Regression scoring with cross_val_score

问题描述

I'm using a very simple csv file that I downloaded from the Internet, with only two columns. The first column is "MonthsExperience" and it goes like "3, 3, 4, 4, 5, 6..." and the second column is like "424, 387, 555, 59, 533...".

I'm trying to get the cross_val_score of the RandomForestRegressor model on simple linear regression for the sake of training.

Here's the code:

import numpy as np
import pandas as pd

data = pd.read_csv("Blogging_Income.csv")

X = data["MonthsExperience"]
y = data["Income"]

from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor()

from sklearn.model_selection import cross_val_score

cv_r2 = cross_val_score(rfr, X, y, cv = 5, scoring = None)
print(cv_r2)

I get a long white warning from sklearn, pointing that all the results are turned to NaN because the model couldn't fit. The upper part of the warning/error I get is like this:

[nan nan nan nan nan]
C:\Users\----\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:615: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "C:\Users\----\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\----\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 304, in fit
    X, y = self._validate_data(X, y, multi_output=True,
  File "C:\Users\----\anaconda3\lib\site-packages\sklearn\base.py", line 433, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "C:\Users\----\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "C:\Users\----\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 871, in check_X_y
    X = check_array(X, accept_sparse=accept_sparse,
  File "C:\Users\----\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "C:\Users\----\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 694, in check_array
    raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[ 6.  6.  7.  8.  8.  9.  9. 10. 11. 11. 12. 12. 12. 13. 13. 14. 14. 15.
 15. 16. 16. 17. 18. 18.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

It appears like the array is in wrong shape but I don't understand why. I also don't understand how I could use array.reshape to make this work.

标签: pythonmachine-learningscikit-learnregressioncross-validation

解决方案


RandomForest, similarly to any other machine learning model, requires your data to be 2D. Even if you have just one feature, your X has to be N x 1, instead of a vector of length N.

You can reshape your data using numpy

X = np.array(X).reshape(-1, 1)

推荐阅读