首页 > 解决方案 > 在 Python Scikit-Learn 中训练测试拆分得分高但 CV 得分低

问题描述

我是数据科学的新手,并且一直在为 Kaggle 的问题而苦苦挣扎。当我使用随机森林回归来预测评分时,发现使用训练测试拆分的分数很高,但使用 CV 分数时分数很低。

https://www.kaggle.com/data13/machine-learning-model-to-predict-app-rating-94

import time
import datetime
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import matplotlib as mpl
import numpy as np
import seaborn as sns
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
from sklearn.metrics import r2_score
import statsmodels.api as sm
import sklearn.model_selection as ms
from sklearn import neighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.cluster import KMeans
from sklearn.neighbors import KDTree
from sklearn import svm
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, ShuffleSplit
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV


from xgboost import XGBClassifier
from xgboost import XGBRegressor 
from lightgbm import LGBMClassifier


database = pd.read_csv(r"C:\Users\Anson\Downloads\49864_274957_bundle_archive\googleplaystore.csv")# store wine type as an attribute



## Size - Strip the M and k value 
database['Size'] = database['Size'].apply(lambda x : x.strip('M'))
database['Size'] = database['Size'].apply(lambda x : x.strip('k'))
##

## Rating - Fill the Blank Value with median
database['Rating'].fillna(database['Rating'].median(),inplace=True)
database['Rating'].replace(19,database['Rating'].median(),inplace=True) 

###


## Reviews -  replace the blank cell
database['Reviews'].replace('3.0M',3000000,inplace=True) 
database['Reviews'].replace('0',float("NaN"),inplace=True) 
database.dropna(subset=['Reviews'],inplace=True)
##


## Strip the + value
database['Installs'] = database['Installs'].apply(lambda x : x.strip('+'))
database['Installs'] = database['Installs'].apply(lambda x : x.replace(',',''))
database['Price'] = database['Price'].apply(lambda x : x.strip('$'))
###

## Drop Blank 
database['Content Rating'].fillna("NaN",inplace=True)
database.dropna(subset=['Content Rating'],inplace=True)
##

## Drop Wrong Number 
database['Last Updated'].replace('1.0.19',float("NaN"),inplace=True) 
database.dropna(subset=['Last Updated'],inplace=True)
database['Last Updated'] = database['Last Updated'].apply(lambda x : time.mktime(datetime.datetime.strptime(x, '%B %d, %Y').timetuple()))
##




le = preprocessing.LabelEncoder()
database['App'] = le.fit_transform(database['App'])
database['Category'] = le.fit_transform(database['Category'])
database['Content Rating'] = le.fit_transform(database['Content Rating'])
database['Type'] = le.fit_transform(database['Type'])
database['Genres'] = le.fit_transform(database['Genres'])




###############################
##feature engineering

features = ['App', 'Reviews', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated']

X=database[features]
y=database['Rating']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=None)


rfc= RandomForestRegressor()


rfc.fit(X_train,y_train)
rfc.fit(X,y)

rfc_score=rfc.score(X_test,y_test)
rfc_score1=rfc.score(X,y)
score_CV_randomforest = cross_val_score(rfc,X,y,cv=KFold(n_splits=5, shuffle=True),scoring='r2')

score_CV_randomforest = score_CV_randomforest.mean()*100


print("with train test split_randomforest", rfc_score)
print("with no train test split_randomforest", rfc_score1)
print("with CV randomforest", score_CV_randomforest, "%")

标签: pythonscikit-learnvirtual-machinerandom-forestcross-validation

解决方案


训练/测试拆分: 您使用 80:20 的比例进行训练和测试。

当数据集被随机分成“k”组时的交叉验证。其中一组用作测试集,其余的用作训练集。该模型在训练集上进行训练并在测试集上进行评分。然后重复该过程,直到将每个唯一组用作测试集。您正在使用 5 折交叉验证,数据集将分为 5 组,模型将单独训练和测试 5 次,因此每个组都有机会成为测试集。

所以产生不同结果的原因是,该模型是在不同的随机样本上训练的。


推荐阅读