How is predicted value calculated when using kfoldpredict with regression?

1 view (last 30 days)
When using kfoldpredict for cross validated model, what determines the predicted value?
I conducted experiment and it seems that the value is selected randomly accross the folds. Is this correct? I was assuming the result could be perhaps average of all the folds.
For example, if using regular predict function with all models when K = 5, the results are 17.25, 16.92, 15.5, 17.25, 18 and the kfoldpredict result is 15.5
Then with next sample, the results are 13.88, 14.58, 14,67, 13.71 and 14.64 and the kfoldpredict result is 13.88.
Code example
clear;
load carsmall
X = [Cylinders Displacement Horsepower Weight];
Y = MPG;
% Remove NaN values
X2 = X;
Y2 = Y;
X2(isnan(Y),:) = [];
Y2(isnan(Y)) = [];
rng('default') % For reproducibility
k = 5;
CVMdl = fitrtree(X,Y,"KFold", k);
% Predict
yfit = kfoldPredict(CVMdl);
mse = mean((yfit - CVMdl.Y).^2)
% Predict fold by fold
for i = 1:k
yhat_kfold(:,i) = predict(CVMdl.Trained{i}, X2);
end
% Create table for analysis
T = table(yhat_kfold, yfit, Y2);

Answers (1)

Kausthub
Kausthub on 6 Sep 2023
Hi Martti Ilvesmäki,
I understand that you would like to know how the predicted value is calculated when using kfoldPredict with regression ( https://www.mathworks.com/help/stats/classreg.learning.partition.regressionpartitionedmodel.kfoldpredict.html ). Also, if the predicted value is randomly selected, why does it do so and why not consider the average across all the folds?
The kfoldPredict does not select a fold randomly to predict. The response for every observation (an input or X) is computed by using the model trained without this observation in the training set.
In the example, 17.25, 16.92, 15.5, 17.25, 18 and the kfoldPredict result is 15.5. This is because the observation (X) corresponding to 15.5 was not included in the training set of the model in fold 3. This is the reason why the predicted value of the model trained in fold 3 was considered. Similarly in the next sample, the results are 13.88, 14.58, 14,67, 13.71, and 14.64 and the kfoldPredict result is 13.88 because the observation (X) corresponding to 13.88 was not present in the training set of the model in fold 1.
The main purpose of k-Fold Cross-Validation is to choose the best model, so taking an average of the predicted values does not make much sense, also even if we were to consider averages it would introduce bias and any model predicting extreme values can mess up the mean square error (MSE) entirely. Taking average of the accuracies (not the average of predicted values) of the models can give a better overall performance understanding of the model.
Here are some references that might be useful for you to better understand k-Fold cross-validation.
Hope this helps and clarifies your queries regarding kFoldPredict!

Products


Release

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!