# Cross-validation ROC curve in discriminant analysis

17 views (last 30 days)
Marta on 13 May 2015
Commented: Greg Heath on 25 Jun 2015
Hello!
I have a series of markers (e.g. weight) whose value to predict a given disease state (healthy, unhealthy) I want to measure using 140 cases.
I can build a discriminant classifier using fitcdiscr and then estimate the cross-validation accuracy with kfoldLoss(crossval(classifier)) . However, because the prevalence in my sample is not close to 0.5, I find that the operating point at which the accuracy is computed is heavily skewed towards type I errors and the accuracy computation is therefore not very useful.
I would instead like to compute a cross-validation (receiver operator characteristics) ROC curve instead. I can:
1. Build a classifier with fitcdiscr, compute the prediction scores using kfoldPredict(crossval(clas)) and use these scores as a new classifier on which to compute the ROC. This, however, gives me a ROC using the posterior probabilities of my weight as a classifier instead of weight itself, which is not what I want.
2. Build my own code in which I vary the threshold of the classification and do a leave-one-out analysis and estimate the ROC from that. This would be quite time consuming though.
3. Find a higher-order Matlab command that would enable me to do 2. Can you help me implement this?
Many thanks for your help! Marta

Ilya on 13 May 2015
In principle, I don't see anything wrong with your proposal 1. There is a caveat however. You shouldn't use the same data to obtain the optimal threshold and to estimate the model accuracy. If you did that, your estimate of classification accuracy would be optimistically biased up. Ideally, you would hold out a subset of your data, say 40 cases, for measuring the accuracy. Then you could train discriminant on 100 cases, cross-validate it to get the optimal threshold, and then apply this discriminant and this threshold to the held-out 40 cases. Unfortunately, your dataset is small to begin with, and holding out 40 cases could noticeably reduce the accuracy of the discriminant.
One workaround would be to drive prevalence closer to 0.5 by giving more probability to the rare class. You can do it by setting 'Prior' to 'uniform' for fitcdiscr. You would then always threshold the posterior probabilities at 0.5.
Another workaround would be to write your own cross-validation utility. Split your data in 10 parts. Train discriminant on 9 parts out of 10. Cross-validate this discriminant to find the optimal threshold. Apply this discriminant with this threshold to the remaining 1/10 of the data and record correct and incorrect classifications. Repeat. This double cross-validation procedure would give you an unbiased estimate of the accuracy for your discriminant.
Greg Heath on 25 Jun 2015
If priors are not equal you can simulate data to equalize the priors I just assume a Gaussian distribution for the simulated data. The mean and covariance are estimated from the actual data.
Then, after training you can multiply the probability estimates by the correct prior ratio and classification costs to estimate the Bayesian Risk.
For an example, Google
unbalanced priors BioID
Hope this helps.
Greg