首页 > 解决方案 > 验证和测试数据的性能差异很大

问题描述

我从 spotify 中抓取了一些数据,看看我是否可以对不同歌曲的音乐类型进行分类。我已将我的数据分成测试集和剩余集,然后我将其进一步分为训练集和验证集。

当我运行模型时(我尝试在 112 种类型之间进行分类),我在验证集中获得了 30% 的准确率。当然,这不是很好,但在 112 种类型和有限数据的情况下是可以预料的。真正让我困惑的是,当我将模型应用于测试数据时,准确率下降到 1%。

我不确定为什么会这样:据我所知,验证和测试数据应该具有可比性。我在应该完全独立的训练数据上训练模型。

我一定是犯了一些错误,要么让模型在验证数据中达到峰值(那里的性能更好),要么弄乱了我的测试数据。

或者,也许两次应用该模型会使事情变得混乱?

知道会发生什么或如何调试它吗?

非常感谢!弗兰卡


from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import shuffle

# re-read data
track_df = pd.read_csv('track_df_corr.csv') 


features = [ 'acousticness', 'speechiness',
           'key', 'liveness', 'instrumentalness', 'energy', 'tempo',
            'loudness', 'danceability', 'valence',
           'duration_mins', 'year', 'genre']


track_df = track_df[features]

#First make a big split of all the data into test and train.
train, test = train_test_split(track_df, test_size=0.2, random_state = 0)

#Then create training and validation data set from the traindata.
# Read the data. Assign train and test data
# "full" is the data before preprocessing
X_full = train 
X_test_full = test 

# select to be predicted data
y = X_full.genre # just the target for the test data
y = pd.factorize(y)[0] # just keep the number - get rid of name by using [0] numbers needed for classifier
  
#Since we later on want to validate our data on the testdata, we also need to make sure we have a #y_test.
# select to be predicted data
y_test = X_test_full.genre # just the target for the test data
y_test = pd.factorize(y_test)[0] # just keep the number - get rid of name by using [0]
                    # numbers needed for classifier


# remove to be predicted variable
X_full.drop(['genre'], axis=1, inplace=True) # rest of training free of target, which is now stored in y
X_test_full.drop(['genre'], axis=1, inplace=True) # not sure if necessary but cannot hurt


# Break off validation set from training data (X_full)
# Remember we still have X_test_full as an entirely independend test set. 
# Here we just create our training and validation sets from X_full.
X_train_full, X_valid_full, y_train, y_valid = \
            train_test_split(X_full, y, train_size=0.8, test_size=0.2, random_state=0)
 
# General preprocessing steps: take care of categorical data (does not apply here).

categorical_cols = [cname for cname in X_train_full.columns if
                    X_train_full[cname].nunique() < 10 and 
                    X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if 
                X_train_full[cname].dtype in ['int64', 'float64']]



# Keep selected columns only
my_cols = categorical_cols + numerical_cols

X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()



#Time to run the model.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer


#Run our model on the TRAINING data
# FRR set up input values that are passed to the Bundle below

# Preprocessing for NUMERICAL data
numerical_transformer = SimpleImputer(strategy='median') 


# Preprocessing for CATEGORICAL data
categorical_transformer = Pipeline(steps=[ # FRR Pipeline of transforms with a "final estimator", here "categorical_transformer".
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# FRR Run the numerical_transformer and categorical_transformer defined above here:
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer( # frr Applies transformers to columns of an array or pandas DataFrame.
    transformers=[ #frr List of (name,transformer,cols) tuples specifying the transformer objects to 
                        #be applied to subsets of the data.
        ('num', numerical_transformer, numerical_cols), 
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define model
model = RandomForestClassifier(n_estimators=100, random_state=0)

# Bundle preprocessing and modeling code in a pipeline
# clf  stands for clasiifier.
# Pipeline can be used to chain multiple estimators into one

# Preprocessing of training data, fit model 
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)
                     ])


# "Calling fit on the pipeline is the same as calling *fit* on each estimator (here: prepoc and model) 
clf.fit(X_train, y_train)


# --------------------------------------------------------

#Test our model on the VALIDATION data

# Preprocessing of validation data, get predictions
preds = clf.predict(X_valid)

# Return the mean accuracy on the given test data and labels.
clf.score(X_valid, y_valid) # this is correct! 

# The code yields a value around 30%. 

# --------------------------------------------------------

Apply our model on the TESTING data
# Preprocessing of training data, fit model 
preds_test = clf.predict(X_test)
clf.score(X_test, y_test)

#The code yields a value around 1%. 

标签: pythonmachine-learningscikit-learnrandom-forest

解决方案


我看到的问题是您正在使用pd.factorize. 由于您使用pd.factorizeonyy_test独立,因此生成的编码将不会相互对应。您想使用LabelEncoder,因此当您fit使用训练数据的编码器时,您可以y_test 使用相同的编码方案进行转换。

这是一个例子来说明这一点:

from sklearn.preprocessing import LabelEncoder

l = [1,4,6,1,4]
le = LabelEncoder()
le.fit(l)
le.transform(l)
# array([0, 1, 2, 0, 1], dtype=int64)
le.transform([1,6,4])
# array([0, 2, 1], dtype=int64)

在这里,我们得到了正确的编码。但是,如果我们应用 a pd.factorize,显然 pandas 无法猜测哪些是正确的编码:

pd.factorize(l)[0]
# array([0, 1, 2, 0, 1], dtype=int64)
pd.factorize([1,6,4])[0]
# array([0, 1, 2], dtype=int64)

推荐阅读