Calculation of training and validation R2 in regression model

11 views (last 30 days)
Hi
I want to develop a Gaussian Regression Process model in Matlab for predicting time series data. I divided my data into 3 parts, the first 70% of data for training, next 15% for validation, and the last 15% for test (my supervisor said partitioning for time series data must be like this, not randomly). I attached training and validation data. I generated the Gaussian Regression Process code from the Regression app learner in Matlab 2017a (based on holdout validation method) and I tried to modify it for the calculation of training and validation R2. I want to know based on this picture, does my code works properly and data are in the right position. I got these results:
trainingR2 =
0.991316397808775
validationR2 =
0.999099977071359
function [trainedModel, validationRMSE] = trainRegressionModel(training_w21,validation_w21)
% [trainedModel, validationRMSE] = trainRegressionModel(trainingData)
% returns a trained regression model and its RMSE. This code recreates the
% model trained in Regression Learner app. Use the generated code to
% automate training the same model with new data, or to learn how to
% programmatically train models.
%
% Input:
% trainingData: a table containing the same predictor and response
% columns as imported into the app.
%
% Output:
% trainedModel: a struct containing the trained regression model. The
% struct contains various fields with information about the trained
% model.
%
% trainedModel.predictFcn: a function to make predictions on new data.
%
% validationRMSE: a double containing the RMSE. In the app, the
% History list displays the RMSE for each model.
%
% Use the code to train the model with new data. To retrain your model,
% call the function from the command line with your original data or new
% data as the input argument trainingData.
%
% For example, to retrain a regression model trained with the original data
% set T, enter:
% [trainedModel, validationRMSE] = trainRegressionModel(T)
%
% To make predictions with the returned 'trainedModel' on new data T2, use
% yfit = trainedModel.predictFcn(T2)
%
% T2 must be a table containing at least the same predictor columns as used
% during training. For details, enter:
% trainedModel.HowToPredict
% Auto-generated by MATLAB on 10-Oct-2022 09:00:24
% Extract predictors and response
% This code processes the data into the right shape for training the
% model.
clc
format long
inputTable = training_w21;
predictorNames = {'x02', 'x11', 'x13', 'x18', 'x29', 'x36'};
predictors = inputTable(:, predictorNames);
response = inputTable.x21;
isCategoricalPredictor = [false, false, false, false, false, false];
% Train a regression model
% This code specifies all the model options and trains the model.
regressionGP = fitrgp(...
predictors, ...
response, ...
'BasisFunction', 'constant', ...
'KernelFunction', 'matern52', ...
'OptimizeHyperparameters','auto',...
'Standardize', true);
% Create the result struct with predict function
predictorExtractionFcn = @(t) t(:, predictorNames);
gpPredictFcn = @(x) predict(regressionGP, x);
trainedModel.predictFcn = @(x) gpPredictFcn(predictorExtractionFcn(x));
% Add additional fields to the result struct
trainedModel.RequiredVariables = {'x02', 'x11', 'x13', 'x18', 'x29', 'x36'};
trainedModel.RegressionGP = regressionGP;
trainedModel.About = 'This struct is a trained model exported from Regression Learner R2017a.';
trainedModel.HowToPredict = sprintf('To make predictions on a new table, T, use: \n yfit = c.predictFcn(T) \nreplacing ''c'' with the name of the variable that is this struct, e.g. ''trainedModel''. \n \nThe table, T, must contain the variables returned by: \n c.RequiredVariables \nVariable formats (e.g. matrix/vector, datatype) must match the original training data. \nAdditional variables are ignored. \n \nFor more information, see <a href="matlab:helpview(fullfile(docroot, ''stats'', ''stats.map''), ''appregression_exportmodeltoworkspace'')">How to predict using an exported model</a>.');
% Extract predictors and response
% This code processes the data into the right shape for training the
% model.
%training
inputTable_tr = training_w21;
predictorNames_tr = {'x02', 'x11', 'x13', 'x18', 'x29', 'x36'};
predictors_tr = inputTable_tr(:, predictorNames);
response_tr = inputTable_tr.x21;
isCategoricalPredictor_tr = [false, false, false, false, false, false];
%validation
inputTable_val = validation_w21;
predictorNames_val = {'x02', 'x11', 'x13', 'x18', 'x29', 'x36'};
predictors_val = inputTable_val(:, predictorNames);
response_val = inputTable_val.x21;
isCategoricalPredictor_val = [false, false, false, false, false, false];
%training
trainingPredictors = predictors_tr;
trainingResponse = response_tr;
trainingIsCategoricalPredictor = isCategoricalPredictor_tr;
%validation
valPredictors = predictors_val;
valResponse = response_val;
valIsCategoricalPredictor = isCategoricalPredictor_val;
% Train a regression model
% This code specifies all the model options and trains the model.
regressionGP = fitrgp(...
valPredictors, ...
valResponse, ...
'BasisFunction', 'constant', ...
'KernelFunction', 'matern52', ...
'OptimizeHyperparameters','auto',...
'Standardize', true);
% Create the result struct with predict function
gpPredictFcn = @(x) predict(regressionGP, x);
validationPredictFcn = @(x) gpPredictFcn(x);
trainingPredictions=trainedModel.predictFcn(trainingPredictors);
%training R2
trainingr = trainingResponse-trainingPredictions;
trainingnormr = norm(trainingr);
trainingSSE = trainingnormr.^2;
trainingSST = norm(trainingResponse-mean(trainingResponse))^2;
trainingR2 = 1 - trainingSSE/trainingSST
%validation R2
validationr = valResponse-validationPredictFcn(valPredictors);
validationnormr = norm(validationr);
validationSSE = validationnormr.^2;
validationSST = norm(valResponse-mean(valResponse))^2;
validationR2 = 1 - validationSSE/validationSST

Answers (1)

Jayanti
Jayanti on 8 Oct 2024
Edited: Jayanti on 8 Oct 2024
Hi Seyed,
I understand that you are developing a Gaussian Process Regression model in MATLAB for time series prediction and wants to verify if the code correctly implements it.
Here are some modifications you can make to the code. I am listing them down.
  • In the code, the model is being trained both on the training and validation dataset which is not a recommended practice since we don’t want to reveal the validation dataset to the model. The purpose of the validation dataset is to tune the hyperparameter used in the model, it should not be used for training purposes.
So, you can remove the code segment which is retraining the model on the validation dataset.
  • The model is being retrained on the validation dataset, and the R² coefficient is reported based on this retrained model. This approach is incorrect. The R² coefficient should be calculated using the model trained exclusively on the training dataset.
Hope it helps!

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!