Advice on data normalization in k-fold cross validation
Show older comments
Hello,
I'd like to compare the performance of two classifiers, namely logistic regression and SVM, to see if they can accurately classify participants' binary response (0/1) from predictor data. My data is repeated measures, so I am block partitioning my data by participant to avoid data leakage using a custom "cvpartition". I use this cvpartition in the function fitclinear() and fitcsvm() to perform 10-fold cross validation.
However, I'd like to scale/normalize my predictor data. Applying feature scale (normalization) before splitting data into training and test sets would result into data leakage (Kapoor & Narayanan, 2023; Zhu et al., 2023). Therefore, I would like to scale my training data separately to my test data.
Firstly, fitcsvm() has the option "standarize", but it is unclear whether standardization occurs seperarely for training and test data for each iteration of the 10-fold cross validation or whether the standarization occurs before the data are split, which would result into leakage.
Second, fitclinear does not have the option to standardize within its function. So it seems I cannot effectively compare the results from fitclinear and fitcsvm at this stage because the normalization cannot be done in the same way.
Has anyone run into this issue before?
If so, am I better off to create my own for-loop wherein I perform the 10-fold cross validation and standardize the training and test data seperately each iteration? I am able to write this loop myself, I am merely wondering whether I am missing a function in MATLAB that does this already.
Thank you for your time.
Kapoor & Narayanan (2023) - Leakage and the reproducibility crisis in machine-learning-based science
Zhu, Yang, and Ren (2023) - Machine Learning in Environmental Research: Common Pitfalls and Best Practices (p. 17677)
6 Comments
Lars K
on 19 Dec 2023
Ive J
on 20 Dec 2023
I failed to find " Applying feature scale (normalization) before splitting data into training and test sets would result into data leakage (Kapoor & Narayanan, 2023; Zhu et al., 2023). " in those papers. Note that, in the second paper, the authors mean data transformation on all data and not pre-training data. So you still can first split your data into training and test sets, and then apply the normalization (or any other transformation), and feed the training set to the ML model in a k-fold CV manner. And finally test the model on the test set. But be aware that, the final test (unseen) cannot be used for model selection, i.e. to decide whether SVM or logistic model performs better. Such decision should be made on the validation (in the CV) set. For this purpose (model selection), one robust approach is nested CV. Hope this helps.
Just a bit of clarification on training, validation and test sets, as in literature sometimes the validation and test sets are being used interchangeably. What I mean below is:
test: unseen data: e.g. you split your whole dataset into training and test set, and you don't touch test set at all untill you're statisfied with your final tuned model. Any transformation you do here, should be done independently (as I initially mentioned above)
training/validation: this is part of the model building process. Say we go with CV, and in each fold we divide our initial training set into a new training (k-1 folds) and a new validation (1 fold) sets.
Having said that, I tent to agree with this point you made:
2) why separate normalization does not occur in CV. Because in CV, you take a part of the training data and use it as "test/validation" data. If the validation data comes from the same distribution as the training and is pre-processed in the same way, wouldn't that cause leakage in your cross-validation and give you over-confident estimations?
Yes, in theory that's correct if by normalization we mean z-score strandardization, because re-scaling does not suffer from this transformation (intuitively all folds will be between 0-1 or whatever upper value). Also, make sure that you don't transform dummy variables (binary/nominal/categorical). I'm not sure that MATLAB ML models do that in hyperparameter tuning (note that in cases where your features are independent, and there is no collinearity or they're fairly normal, it's ok to apply normalization on the whole training set). So, the best practice would be to apply the standardization in each fold separately in each fold of CV on the k-1 fold of new training and validation set. Then you would calculate the perforamnce on the overall validation sets and report the mean (as MATLAB and other tools do). Since I'm not sure that MATLAB ML models behave in this manner, you can implement your own CV object and apply the data transformation separately on training and validation sets in each fold, which would avoid such data leakage. When you're happy with your model, then it's time to introduce the held-out test dataset and this is the final performance of the model you should report (and not the validation performance from CV).
Since, I guess you want to perform a model selection (SVM vs RF for instance), this gets a bit tricky, and that's why I suggested to go with a nested CV approach (read more here). In order to have a fair comparison between different models, make sure to set a fixed seed for your pipeline:
rng(123) % or whatever
Inner CV would take care of hyperparameter tunning and the outer CV (based on performance on the validation set) will be used for model selection (does SVM perform better and RF?).
Below you can find an example of such implementations (except you should apply your desired transformation on each fold's training and validation sets separately):
% for regression learners, change accordingly for classifiers.
% tab: training data set
learners = array2table( ...
["ensemble" "CompactRegressionEnsemble" "fitrensemble"
"gp" "CompactRegressionGP" "fitrgp"
"kernel" "RegressionKernel" "fitrkernel"
"linear" "RegressionLinear" "fitrlinear"
"net" "CompactRegressionNeuralNetwork" "fitrnet"
"svm" "CompactRegressionSVM" "fitrsvm"
"tree" "CompactRegressionTree" "fitrtree"], ...
"VariableNames", ["name", "class", "func"]);
cvp = cvpartition(size(tab, 1), 'KFold', KFolds);
bopts.AcquisitionFunctionName = "expected-improvement-plus";
bopts.Optimizer = "bayesopt";
learners.RMSE = nan(height(learners), KFolds);
inRMSEv = table(nan(KFolds, 1), strings(KFolds, 1), ...
VariableNames=["RMSE", "class"]);
% Nested CV
for i = 1:KFolds
% apply your transformation on tabtrain and tabtest
tabtrain = tab(cvp.training(i), :);
tabtest = tab(cvp.test(i), :);
bopts.CVPartition = cvpartition(size(tabtrain.(response), 1), 'KFold', KFolds);
% loop over learners spearately
for j = 1:height(learners)
tmpMdl = feval(learners.func(j), tabtrain, response, ...
"OptimizeHyperparameters","all", ...
"HyperparameterOptimizationOptions", bopts, ...
"CategoricalPredictors", inopts.CategoricalPredictors);
learners.RMSE(j, i) = sqrt(tmpMdl.loss(tabtest, response, "LossFun", 'mse'));
end
end
% which learner performed better?
learners.RMSE = mean(learners.RMSE, 2);
learners = sortrows(learners, "RMSE", "ascend");
bestLearner = learners(1, :);
% train the whole traning dataset on the best model from nested CV
fprintf(", best learner: %s", bestLearner.name)
% again here, you should apply your own CV (similar to above but a
% single loop)...
bopts.CVPartition = cvpartition(size(tab.(response), 1), 'KFold', KFolds);
bopts.AcquisitionFunctionName = "expected-improvement-plus";
bopts.Optimizer = "bayesopt";
tmpMdl = feval(bestLearner.func, tab, response, ...
"OptimizeHyperparameters","all", ...
"HyperparameterOptimizationOptions", bopts, ...
"CategoricalPredictors", inopts.CategoricalPredictors);
Hope this helps!
Lars K
on 22 Dec 2023
Ive J
on 22 Dec 2023
Glad it was helpful.
Happy holidays!
Answers (1)
For data normalization, you can use MATLAB's builtin fcn: normalize(), e.g.:
A = 1:5;
A_nor = normalize(A, 'range')
Once you created a model with "standardize" option that would be applciable for your trainings and testing/validation data.
Categories
Find more on Classification Ensembles in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!