Advice on data normalization in k-fold cross validation

Hello,
I'd like to compare the performance of two classifiers, namely logistic regression and SVM, to see if they can accurately classify participants' binary response (0/1) from predictor data. My data is repeated measures, so I am block partitioning my data by participant to avoid data leakage using a custom "cvpartition". I use this cvpartition in the function fitclinear() and fitcsvm() to perform 10-fold cross validation.
However, I'd like to scale/normalize my predictor data. Applying feature scale (normalization) before splitting data into training and test sets would result into data leakage (Kapoor & Narayanan, 2023; Zhu et al., 2023). Therefore, I would like to scale my training data separately to my test data.
Firstly, fitcsvm() has the option "standarize", but it is unclear whether standardization occurs seperarely for training and test data for each iteration of the 10-fold cross validation or whether the standarization occurs before the data are split, which would result into leakage.
Second, fitclinear does not have the option to standardize within its function. So it seems I cannot effectively compare the results from fitclinear and fitcsvm at this stage because the normalization cannot be done in the same way.
Has anyone run into this issue before?
If so, am I better off to create my own for-loop wherein I perform the 10-fold cross validation and standardize the training and test data seperately each iteration? I am able to write this loop myself, I am merely wondering whether I am missing a function in MATLAB that does this already.
Thank you for your time.
Kapoor & Narayanan (2023) - Leakage and the reproducibility crisis in machine-learning-based science
Zhu, Yang, and Ren (2023) - Machine Learning in Environmental Research: Common Pitfalls and Best Practices (p. 17677)

6 Comments

Below the 10-fold cross validation loop coded myself, where data is normalized separately for training and test set.
There are 30 participants in the study, measured 8 times.
Label = 240 x 1 array. Predictor = 240 x 7 array.
10-fold cross validation makes 10 groups of 3 participants. FullCombs is the total number of unique combinations of 10 groups of 3 participants, without repetition.
ParInFold is a matrix that contains the unique combinations of 10 groups of 3 participants on each row, without repetition. This is to ensure that never the same 3 participants are grouped in a single fold more than once. (e.g., participant 1 2 and 3 are only in a fold together once. However, participant 1 and 2 can be in a fold with either participant 4, 5, 6 and onwards..)
AllParArray is a repeated array of the participant numbers 1:30 that corresponds to the repeated measure data. 30 participants, 8 conditions.
NumModels = 2;
NumFolds = 10;
idx = 1 : 3 : 30;
parfor nIter = 1 : FullCombs % Iteration over unique folds
Groups = ParInFold(nIter,:); % Grab line of unique grouping
cvIndices = NaN(length(AllParArray),1); % Allocate space to store grouping nums
tempLabel = NaN(NumFolds,NumModels);
for nFolds = 1 : NumFolds % Iterate to allocate group nums
if(nFolds<=NumFolds-1)
cvIndices(ismember(AllParArray,Groups(1,idx(nFolds):idx(nFolds+1)-1)),1)=nFolds;
else
cvIndices(ismember(AllParArray,Groups(1,idx(nFolds):end)),1)=nFolds;
end
end
for nFolds = 1 : NumFolds % Iterate over folds
Train = normalize(Predictor(cvIndices ~= nFolds,:)); % Grab training data
Test = normalize(Predictor(cvIndices == nFolds,:)); % Grab test data
for nModels = 1 : NumModels % Iterate over models
switch(nModels)
case 1
rng('default')
Mdl = fitclinear(Train,Label(cvIndices~=nFolds),'Learner','logistic','Cost',Cost,'Solver', 'bfgs','Regularization','ridge'); % Fit logistic classifier
case 2
rng('default')
Mdl = fitcsvm(Train,Label(cvIndices~=nFolds),'KernelFunction','RBF','KernelScale','auto','Standardize',false,'Cost',Cost); % Fit SVM classifier
end
tempLabel(cvIndices==nFolds,nModels) = predict(Mdl,Test); % store the predicted labels in a temp array for later storage into the full array label. This is to allow parallel processing.
end
end
label(nIter,:,:) = tempLabel;
end
I failed to find " Applying feature scale (normalization) before splitting data into training and test sets would result into data leakage (Kapoor & Narayanan, 2023; Zhu et al., 2023). " in those papers. Note that, in the second paper, the authors mean data transformation on all data and not pre-training data. So you still can first split your data into training and test sets, and then apply the normalization (or any other transformation), and feed the training set to the ML model in a k-fold CV manner. And finally test the model on the test set. But be aware that, the final test (unseen) cannot be used for model selection, i.e. to decide whether SVM or logistic model performs better. Such decision should be made on the validation (in the CV) set. For this purpose (model selection), one robust approach is nested CV. Hope this helps.
Hi Ive J,
Thank you for your time and valuable feedback. Apologies in advance for the long reply. I understand from your response that it is okay to perform normalization on training data that is then used for CV. I am still a little confused and am wondering if you would be able to clarify the following for me.
Literature switches a lot between the terms test and validation, which does not help my confusion. I haven't found a super clear example paper that details classification steps and model selection very well, and I am extremely cautious to not accidentally have data leakage.
Currently, I'm still struggling with the difference between 1) normalizing training and test data separately (i.e., physical hold-out data), and 2) why separate normalization does not occur in CV. Because in CV, you take a part of the training data and use it as "test/validation" data. If the validation data comes from the same distribution as the training and is pre-processed in the same way, wouldn't that cause leakage in your cross-validation and give you over-confident estimations?
My thinking comes from Kuhn & Johnson (2013) Chapter 4 page 67. Here, the authors note that:
"The "training" data set is the general term for the samples used to create the model, while the "test" or "validation" data set is used to qualify performance."
Following that, Pargent et al. (2023) state that:
" In ML, it is always the predictive performance on new observations that is of practical and theoretical interest. Thus, we want to know how well a model trained on a specific data set will predict new, unseen data (out-of-sample performance). The ideal approach would be to collect a new sample from the same population. However, this approach is often not feasible in practice. A naive alternative would be to estimate predictive performance on the basis of the same data used to train the model (in-sample performance). Unfortunately, this procedure can lead to an extreme overestimation of predictive performance, which we demonstrate later. A better approach for model evaluation is to use resampling methods, which are a smart way of recycling the available data to estimate out-of-sample performance. The general principle is to use the available sample to simulate what happens when the trained model will be applied on new observations in a practical application. To produce a realistic estimate of expected performance, resampling methods must ensure a strict separation of model fitting and model evaluation. This rule implies that different data must be used for training and testing the model. "
So resampling (e.g., CV) is used to estimate model performance on seemingly "unseen" data, when you have limited data available by resampling the data into training and test sets.
In Kapoor & Narayanan (2023) it is mentioned for point [L1.2] and [L3.2] that training and test data should be pre-processed seperately and there should be independence between them. This would be violated if I normalize first and then split, as you rightfully point out in your text. However, the process of transforming first, then splitting is exemplified in Figure 3 by Sheetal et al. (2023), which would be wrong.
So it seems it is not entirely clear in literature when normalization should occur. This leaves me wondering if in each fold during CV a separate normalization should occur?
Furthermore, you mention: "But be aware that, the final test (unseen) cannot be used for model selection, i.e. to decide whether SVM or logistic model performs better. Such decision should be made on the validation (in the CV) set."
But Kuhn & Johnson (2013) seem to suggest that test data is used to assess performance.
Nonetheless, in an ideal case, I would split my Dataset into Training and Test data, which I would treat completely seperately. Then, if I run CV on my Training data, do transform all of the Training data and then perform CV?
Also, can I iteratively perform these steps? As in, repeatedly split my Dataset into new unique Training and Test sets, then CV my Training set and then predict on my Test set in each iteration? Because the performance of my model would heaviliy depend on my Training data, which would be randomly selected from my complete data set. So in order to find true performance and utilise all the data from the Dataset, would that be the process?
Thank you again for your time and apologies again for the long response.
Looking forward to hearing your thoughts.
Cheers
Kuhn, M., Johnson, K. (2013). Over-Fitting and Model Tuning. In: Applied Predictive Modeling. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6849-3_4
Sheetal, A., Jiang, Z., & Di Milia, L. (2023). Using machine learning to analyze longitudinal data: A tutorial guide and best‐practice recommendations for social science researchers. Applied Psychology.
Pargent F, Schoedel R, Stachl C. Best Practices in Supervised Machine Learning: A Tutorial for Psychologists. Advances in Methods and Practices in Psychological Science. 2023;6(3). doi:10.1177/25152459231162559
Just a bit of clarification on training, validation and test sets, as in literature sometimes the validation and test sets are being used interchangeably. What I mean below is:
test: unseen data: e.g. you split your whole dataset into training and test set, and you don't touch test set at all untill you're statisfied with your final tuned model. Any transformation you do here, should be done independently (as I initially mentioned above)
training/validation: this is part of the model building process. Say we go with CV, and in each fold we divide our initial training set into a new training (k-1 folds) and a new validation (1 fold) sets.
Having said that, I tent to agree with this point you made:
2) why separate normalization does not occur in CV. Because in CV, you take a part of the training data and use it as "test/validation" data. If the validation data comes from the same distribution as the training and is pre-processed in the same way, wouldn't that cause leakage in your cross-validation and give you over-confident estimations?
Yes, in theory that's correct if by normalization we mean z-score strandardization, because re-scaling does not suffer from this transformation (intuitively all folds will be between 0-1 or whatever upper value). Also, make sure that you don't transform dummy variables (binary/nominal/categorical). I'm not sure that MATLAB ML models do that in hyperparameter tuning (note that in cases where your features are independent, and there is no collinearity or they're fairly normal, it's ok to apply normalization on the whole training set). So, the best practice would be to apply the standardization in each fold separately in each fold of CV on the k-1 fold of new training and validation set. Then you would calculate the perforamnce on the overall validation sets and report the mean (as MATLAB and other tools do). Since I'm not sure that MATLAB ML models behave in this manner, you can implement your own CV object and apply the data transformation separately on training and validation sets in each fold, which would avoid such data leakage. When you're happy with your model, then it's time to introduce the held-out test dataset and this is the final performance of the model you should report (and not the validation performance from CV).
Since, I guess you want to perform a model selection (SVM vs RF for instance), this gets a bit tricky, and that's why I suggested to go with a nested CV approach (read more here). In order to have a fair comparison between different models, make sure to set a fixed seed for your pipeline:
rng(123) % or whatever
Inner CV would take care of hyperparameter tunning and the outer CV (based on performance on the validation set) will be used for model selection (does SVM perform better and RF?).
Below you can find an example of such implementations (except you should apply your desired transformation on each fold's training and validation sets separately):
% for regression learners, change accordingly for classifiers.
% tab: training data set
learners = array2table( ...
["ensemble" "CompactRegressionEnsemble" "fitrensemble"
"gp" "CompactRegressionGP" "fitrgp"
"kernel" "RegressionKernel" "fitrkernel"
"linear" "RegressionLinear" "fitrlinear"
"net" "CompactRegressionNeuralNetwork" "fitrnet"
"svm" "CompactRegressionSVM" "fitrsvm"
"tree" "CompactRegressionTree" "fitrtree"], ...
"VariableNames", ["name", "class", "func"]);
cvp = cvpartition(size(tab, 1), 'KFold', KFolds);
bopts.AcquisitionFunctionName = "expected-improvement-plus";
bopts.Optimizer = "bayesopt";
learners.RMSE = nan(height(learners), KFolds);
inRMSEv = table(nan(KFolds, 1), strings(KFolds, 1), ...
VariableNames=["RMSE", "class"]);
% Nested CV
for i = 1:KFolds
% apply your transformation on tabtrain and tabtest
tabtrain = tab(cvp.training(i), :);
tabtest = tab(cvp.test(i), :);
bopts.CVPartition = cvpartition(size(tabtrain.(response), 1), 'KFold', KFolds);
% loop over learners spearately
for j = 1:height(learners)
tmpMdl = feval(learners.func(j), tabtrain, response, ...
"OptimizeHyperparameters","all", ...
"HyperparameterOptimizationOptions", bopts, ...
"CategoricalPredictors", inopts.CategoricalPredictors);
learners.RMSE(j, i) = sqrt(tmpMdl.loss(tabtest, response, "LossFun", 'mse'));
end
end
% which learner performed better?
learners.RMSE = mean(learners.RMSE, 2);
learners = sortrows(learners, "RMSE", "ascend");
bestLearner = learners(1, :);
% train the whole traning dataset on the best model from nested CV
fprintf(", best learner: %s", bestLearner.name)
% again here, you should apply your own CV (similar to above but a
% single loop)...
bopts.CVPartition = cvpartition(size(tab.(response), 1), 'KFold', KFolds);
bopts.AcquisitionFunctionName = "expected-improvement-plus";
bopts.Optimizer = "bayesopt";
tmpMdl = feval(bestLearner.func, tab, response, ...
"OptimizeHyperparameters","all", ...
"HyperparameterOptimizationOptions", bopts, ...
"CategoricalPredictors", inopts.CategoricalPredictors);
Hope this helps!
Thank you for your in-depth response Ive J! This is very clear and very helpful!
Your comment "note that in cases where your features are independent, and there is no collinearity or they're fairly normal, it's ok to apply normalization on the whole training set" is also very interesting, as I have been thinking about the influence of multicollinearity in ML.
In any case, your feedback is very insightful and I will play around with your suggestions in MATLAB to see if I can create a ML classifier on some dummy data (e.g., spirals) or some exisiting data sets.
Thank you again for your time. Happy holidays!
Cheers
Glad it was helpful.
Happy holidays!

Sign in to comment.

Answers (1)

For data normalization, you can use MATLAB's builtin fcn: normalize(), e.g.:
A = 1:5;
A_nor = normalize(A, 'range')
A_nor = 1×5
0 0.2500 0.5000 0.7500 1.0000
Once you created a model with "standardize" option that would be applciable for your trainings and testing/validation data.

Products

Release

R2023b

Asked:

on 19 Dec 2023

Commented:

on 22 Dec 2023

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!