python - 验证和测试数据的性能差异很大
问题描述
我从 spotify 中抓取了一些数据,看看我是否可以对不同歌曲的音乐类型进行分类。我已将我的数据分成测试集和剩余集,然后我将其进一步分为训练集和验证集。
当我运行模型时(我尝试在 112 种类型之间进行分类),我在验证集中获得了 30% 的准确率。当然,这不是很好,但在 112 种类型和有限数据的情况下是可以预料的。真正让我困惑的是,当我将模型应用于测试数据时,准确率下降到 1%。
我不确定为什么会这样:据我所知,验证和测试数据应该具有可比性。我在应该完全独立的训练数据上训练模型。
我一定是犯了一些错误,要么让模型在验证数据中达到峰值(那里的性能更好),要么弄乱了我的测试数据。
或者,也许两次应用该模型会使事情变得混乱?
知道会发生什么或如何调试它吗?
非常感谢!弗兰卡
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import shuffle
# re-read data
track_df = pd.read_csv('track_df_corr.csv')
features = [ 'acousticness', 'speechiness',
'key', 'liveness', 'instrumentalness', 'energy', 'tempo',
'loudness', 'danceability', 'valence',
'duration_mins', 'year', 'genre']
track_df = track_df[features]
#First make a big split of all the data into test and train.
train, test = train_test_split(track_df, test_size=0.2, random_state = 0)
#Then create training and validation data set from the traindata.
# Read the data. Assign train and test data
# "full" is the data before preprocessing
X_full = train
X_test_full = test
# select to be predicted data
y = X_full.genre # just the target for the test data
y = pd.factorize(y)[0] # just keep the number - get rid of name by using [0] numbers needed for classifier
#Since we later on want to validate our data on the testdata, we also need to make sure we have a #y_test.
# select to be predicted data
y_test = X_test_full.genre # just the target for the test data
y_test = pd.factorize(y_test)[0] # just keep the number - get rid of name by using [0]
# numbers needed for classifier
# remove to be predicted variable
X_full.drop(['genre'], axis=1, inplace=True) # rest of training free of target, which is now stored in y
X_test_full.drop(['genre'], axis=1, inplace=True) # not sure if necessary but cannot hurt
# Break off validation set from training data (X_full)
# Remember we still have X_test_full as an entirely independend test set.
# Here we just create our training and validation sets from X_full.
X_train_full, X_valid_full, y_train, y_valid = \
train_test_split(X_full, y, train_size=0.8, test_size=0.2, random_state=0)
# General preprocessing steps: take care of categorical data (does not apply here).
categorical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].nunique() < 10 and
X_train_full[cname].dtype == "object"]
# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].dtype in ['int64', 'float64']]
# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()
#Time to run the model.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
#Run our model on the TRAINING data
# FRR set up input values that are passed to the Bundle below
# Preprocessing for NUMERICAL data
numerical_transformer = SimpleImputer(strategy='median')
# Preprocessing for CATEGORICAL data
categorical_transformer = Pipeline(steps=[ # FRR Pipeline of transforms with a "final estimator", here "categorical_transformer".
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# FRR Run the numerical_transformer and categorical_transformer defined above here:
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer( # frr Applies transformers to columns of an array or pandas DataFrame.
transformers=[ #frr List of (name,transformer,cols) tuples specifying the transformer objects to
#be applied to subsets of the data.
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
# Define model
model = RandomForestClassifier(n_estimators=100, random_state=0)
# Bundle preprocessing and modeling code in a pipeline
# clf stands for clasiifier.
# Pipeline can be used to chain multiple estimators into one
# Preprocessing of training data, fit model
clf = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)
])
# "Calling fit on the pipeline is the same as calling *fit* on each estimator (here: prepoc and model)
clf.fit(X_train, y_train)
# --------------------------------------------------------
#Test our model on the VALIDATION data
# Preprocessing of validation data, get predictions
preds = clf.predict(X_valid)
# Return the mean accuracy on the given test data and labels.
clf.score(X_valid, y_valid) # this is correct!
# The code yields a value around 30%.
# --------------------------------------------------------
Apply our model on the TESTING data
# Preprocessing of training data, fit model
preds_test = clf.predict(X_test)
clf.score(X_test, y_test)
#The code yields a value around 1%.
解决方案
我看到的问题是您正在使用pd.factorize
. 由于您使用pd.factorize
ony
且y_test
独立,因此生成的编码将不会相互对应。您想使用LabelEncoder
,因此当您fit
使用训练数据的编码器时,您可以y_test
使用相同的编码方案进行转换。
这是一个例子来说明这一点:
from sklearn.preprocessing import LabelEncoder
l = [1,4,6,1,4]
le = LabelEncoder()
le.fit(l)
le.transform(l)
# array([0, 1, 2, 0, 1], dtype=int64)
le.transform([1,6,4])
# array([0, 2, 1], dtype=int64)
在这里,我们得到了正确的编码。但是,如果我们应用 a pd.factorize
,显然 pandas 无法猜测哪些是正确的编码:
pd.factorize(l)[0]
# array([0, 1, 2, 0, 1], dtype=int64)
pd.factorize([1,6,4])[0]
# array([0, 1, 2], dtype=int64)
推荐阅读
- c# - 在运行时保存当前状态
- javascript - 将字符串中的前导或尾随空格移出星号
- regex - Azure CDN 规则引擎删除 .html 扩展名
- c# - NC Link 开启时是否可以执行 SelectMainProgram/SelectScheduleProgram?
- python - 如何使用 pandas 将每个进程从多处理写入单独的 csv 文件?
- qt - 找不到有效的设置文件
- angular - Ionic4 自定义相机视图
- botframework - Bot-framework 机器人创建群聊
- python - 当内部有 NULL 值时如何解析 JSON?
- python-3.x - 尽管按钮当时被禁用,PySimpleGUI 仍会读取按钮事件