python - 在 Python Scikit-Learn 中训练测试拆分得分高但 CV 得分低
问题描述
我是数据科学的新手,并且一直在为 Kaggle 的问题而苦苦挣扎。当我使用随机森林回归来预测评分时,发现使用训练测试拆分的分数很高,但使用 CV 分数时分数很低。
- 带火车测试 split_randomforest 0.8746277302652172
- 没有火车测试 split_randomforest 0.8750717943467078
- 使用 CV 随机森林 10.713885026374156 %
https://www.kaggle.com/data13/machine-learning-model-to-predict-app-rating-94
import time
import datetime
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import matplotlib as mpl
import numpy as np
import seaborn as sns
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
from sklearn.metrics import r2_score
import statsmodels.api as sm
import sklearn.model_selection as ms
from sklearn import neighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.cluster import KMeans
from sklearn.neighbors import KDTree
from sklearn import svm
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, ShuffleSplit
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from lightgbm import LGBMClassifier
database = pd.read_csv(r"C:\Users\Anson\Downloads\49864_274957_bundle_archive\googleplaystore.csv")# store wine type as an attribute
## Size - Strip the M and k value
database['Size'] = database['Size'].apply(lambda x : x.strip('M'))
database['Size'] = database['Size'].apply(lambda x : x.strip('k'))
##
## Rating - Fill the Blank Value with median
database['Rating'].fillna(database['Rating'].median(),inplace=True)
database['Rating'].replace(19,database['Rating'].median(),inplace=True)
###
## Reviews - replace the blank cell
database['Reviews'].replace('3.0M',3000000,inplace=True)
database['Reviews'].replace('0',float("NaN"),inplace=True)
database.dropna(subset=['Reviews'],inplace=True)
##
## Strip the + value
database['Installs'] = database['Installs'].apply(lambda x : x.strip('+'))
database['Installs'] = database['Installs'].apply(lambda x : x.replace(',',''))
database['Price'] = database['Price'].apply(lambda x : x.strip('$'))
###
## Drop Blank
database['Content Rating'].fillna("NaN",inplace=True)
database.dropna(subset=['Content Rating'],inplace=True)
##
## Drop Wrong Number
database['Last Updated'].replace('1.0.19',float("NaN"),inplace=True)
database.dropna(subset=['Last Updated'],inplace=True)
database['Last Updated'] = database['Last Updated'].apply(lambda x : time.mktime(datetime.datetime.strptime(x, '%B %d, %Y').timetuple()))
##
le = preprocessing.LabelEncoder()
database['App'] = le.fit_transform(database['App'])
database['Category'] = le.fit_transform(database['Category'])
database['Content Rating'] = le.fit_transform(database['Content Rating'])
database['Type'] = le.fit_transform(database['Type'])
database['Genres'] = le.fit_transform(database['Genres'])
###############################
##feature engineering
features = ['App', 'Reviews', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated']
X=database[features]
y=database['Rating']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=None)
rfc= RandomForestRegressor()
rfc.fit(X_train,y_train)
rfc.fit(X,y)
rfc_score=rfc.score(X_test,y_test)
rfc_score1=rfc.score(X,y)
score_CV_randomforest = cross_val_score(rfc,X,y,cv=KFold(n_splits=5, shuffle=True),scoring='r2')
score_CV_randomforest = score_CV_randomforest.mean()*100
print("with train test split_randomforest", rfc_score)
print("with no train test split_randomforest", rfc_score1)
print("with CV randomforest", score_CV_randomforest, "%")
解决方案
训练/测试拆分: 您使用 80:20 的比例进行训练和测试。
当数据集被随机分成“k”组时的交叉验证。其中一组用作测试集,其余的用作训练集。该模型在训练集上进行训练并在测试集上进行评分。然后重复该过程,直到将每个唯一组用作测试集。您正在使用 5 折交叉验证,数据集将分为 5 组,模型将单独训练和测试 5 次,因此每个组都有机会成为测试集。
所以产生不同结果的原因是,该模型是在不同的随机样本上训练的。
推荐阅读
- android - 从 SDK 中隐藏资源 ID
- c++ - 本地 QEventLoop - 等待来自线程的信号 - 防止处理来自主事件循环的事件
- javascript - THREE.DragControls 不是构造函数错误
- linux - Apache - 最后一小时的日志
- php - 将 PHP Prepared 语句添加到 SELECT 语句
- vb.net - VB.Net 必须是非负数且小于集合的大小
- java - 我可以在同一 REST API 响应中发送带有文件描述的 excel 文件和 JSON 正文吗
- php - 消息:反序列化():在 1718 个字节的偏移 1683 处出错
- android - 如何为 Firebase FCM 设置默认大图标(不是默认图标)?
- c - 宏作为 switch-case 中的 case 在 c 中不起作用