python - scikit-learn 中的高斯过程:在训练数据上表现良好,在测试数据上表现不佳
问题描述
我编写了一个 Python 脚本,用于scikit-learn
将高斯过程拟合到某些数据。
简而言之:我面临的问题是,虽然高斯过程似乎对训练数据集学习得很好,但对测试数据集的预测却是错误的,在我看来,这背后存在归一化问题。
详细说明:我的训练数据集是一组1500
时间序列。每个时间序列都有50
时间分量。高斯过程学习的映射在一组三个坐标x,y,z
(代表我的模型的参数)和一个时间序列之间。换句话说,x,y,z
一个时间序列之间存在 1:1 的映射,而 GP 学习这种映射。这个想法是,通过为训练有素的 GP 提供新坐标,他们应该能够为我提供与这些坐标相关的预测时间序列。
这是我的代码:
from __future__ import division
import numpy as np
from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern
coordinates_training = np.loadtxt(...) # read coordinates training x, y, z from file
coordinates_testing = np.loadtxt(..) # read coordinates testing x, y, z from file
# z-score of the coordinates for the training and testing data.
# Note I am using the mean and std of the training dataset ALSO to normalize the testing dataset
mean_coords_training = np.zeros(3)
std_coords_training = np.zeros(3)
for i in range(3):
mean_coords_training[i] = coordinates_training[:, i].mean()
std_coords_training[i] = coordinates_training[:, i].std()
coordinates_training[:, i] = (coordinates_training[:, i] - mean_coords_training[i])/std_coords_training[i]
coordinates_testing[:, i] = (coordinates_testing[:, i] - mean_coords_training[i])/std_coords_training[i]
time_series_training = np.loadtxt(...)# reading time series of training data from file
number_of_time_components = np.shape(time_series_training)[1] # 100 time components
# z_score of the time series
mean_time_series_training = np.zeros(number_of_time_components)
std_time_series_training = np.zeros(number_of_time_components)
for i in range(number_of_time_components):
mean_time_series_training[i] = time_series_training[:, i].mean()
std_time_series_training[i] = time_series_training[:, i].std()
time_series_training[:, i] = (time_series_training[:, i] - mean_time_series_training[i])/std_time_series_training[i]
time_series_testing = np.loadtxt(...)# reading test data from file
# the number of time components is the same for training and testing dataset
# z-score of testing data, again using mean and std of training data
for i in range(number_of_time_components):
time_series_testing[:, i] = (time_series_testing[:, i] - mean_time_series_training[i])/std_time_series_training[i]
# GPs
pred_time_series_training = np.zeros((np.shape(time_series_training)))
pred_time_series_testing = np.zeros((np.shape(time_series_testing)))
# Instantiate a Gaussian Process model
kernel = 1.0 * Matern(nu=1.5)
gp = GaussianProcessRegressor(kernel=kernel)
for i in range(number_of_time_components):
print("time component", i)
# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(coordinates_training, time_series_training[:,i])
# Make the prediction on the meshed x-axis (ask for MSE as well)
y_pred_train, sigma_train = gp.predict(coordinates_train, return_std=True)
y_pred_test, sigma_test = gp.predict(coordinates_test, return_std=True)
pred_time_series_training[:,i] = y_pred_train*std_time_series_training[i] + mean_time_series_training[i]
pred_time_series_testing[:,i] = y_pred_test*std_time_series_training[i] + mean_time_series_training[i]
# plot training
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
ax[i].plot(time_series_training[100*i], color='blue', label='Original training')
ax[i].plot(pred_time_series_training[100*i], color='black', label='GP predicted - training')
# plot testing
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
ax[i].plot(features_time_series_testing[100*i], color='blue', label='Original testing')
ax[i].plot(pred_time_series_testing[100*i], color='black', label='GP predicted - testing')
解决方案
首先,您应该使用 sklearn 预处理工具来处理您的数据。
from sklearn.preprocessing import StandardScaler
还有其他有用的工具可以组织,但这个特定的工具可以规范化数据。其次,您应该使用相同的参数对训练集和测试集进行归一化。模型将拟合数据的“几何”来定义参数,如果您使用其他比例训练模型,则类似使用错误的单位制。
scale = StandardScaler()
training_set = scale.fit_tranform(data_train)
test_set = scale.transform(data_test)
这将在集合中使用相同的转换。
最后你需要规范化特征而不是traget,我的意思是规范化X条目而不是Y输出,规范化有助于模型更快地找到答案在优化过程中改变目标函数的拓扑输出不会影响这一点.
我希望这能回答你的问题。
推荐阅读
- php - 自动同步多个不同表结构的sql表
- c# - 我们可以在 .Net 核心上使用 iText 7 将 pdf 转换为 docx 吗?
- sql - 按 JSON 数组中的元素选择前 3 个元素
- android - 打开 WebView 页面后 Flutter AR Camera 崩溃
- css - 如何制作滑块动画?
- python - Numpy 从另一列中找到每个值最常见的项目
- sql - 如何简化我的 Active Record 代码(我想使用连接表列过滤 has_many)?
- javascript - mup 部署错误 - x Start Meteor: FAILED
- python - Groupby sum 在Python中不产生总和输出
- postgresql - 如何通过过滤重复记录查询与会者