Cross-validation ROC curve in discriminant analysis

Question

Marta on 13 May 2015

0
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/216557-cross-validation-roc-curve-in-discriminant-analysis

Commented: Greg Heath on 25 Jun 2015

Hello!

I have a series of markers (e.g. weight) whose value to predict a given disease state (healthy, unhealthy) I want to measure using 140 cases.

I can build a discriminant classifier using fitcdiscr and then estimate the cross-validation accuracy with kfoldLoss(crossval(classifier)) . However, because the prevalence in my sample is not close to 0.5, I find that the operating point at which the accuracy is computed is heavily skewed towards type I errors and the accuracy computation is therefore not very useful.

I would instead like to compute a cross-validation (receiver operator characteristics) ROC curve instead. I can:

1. Build a classifier with fitcdiscr, compute the prediction scores using kfoldPredict(crossval(clas)) and use these scores as a new classifier on which to compute the ROC. This, however, gives me a ROC using the posterior probabilities of my weight as a classifier instead of weight itself, which is not what I want.

2. Build my own code in which I vary the threshold of the classification and do a leave-one-out analysis and estimate the ROC from that. This would be quite time consuming though.

3. Find a higher-order Matlab command that would enable me to do 2. Can you help me implement this?

Many thanks for your help! Marta

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Ilya on 13 May 2015

0
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/216557-cross-validation-roc-curve-in-discriminant-analysis#answer_179016

In principle, I don't see anything wrong with your proposal 1. There is a caveat however. You shouldn't use the same data to obtain the optimal threshold and to estimate the model accuracy. If you did that, your estimate of classification accuracy would be optimistically biased up. Ideally, you would hold out a subset of your data, say 40 cases, for measuring the accuracy. Then you could train discriminant on 100 cases, cross-validate it to get the optimal threshold, and then apply this discriminant and this threshold to the held-out 40 cases. Unfortunately, your dataset is small to begin with, and holding out 40 cases could noticeably reduce the accuracy of the discriminant.

One workaround would be to drive prevalence closer to 0.5 by giving more probability to the rare class. You can do it by setting 'Prior' to 'uniform' for fitcdiscr. You would then always threshold the posterior probabilities at 0.5.

Another workaround would be to write your own cross-validation utility. Split your data in 10 parts. Train discriminant on 9 parts out of 10. Cross-validate this discriminant to find the optimal threshold. Apply this discriminant with this threshold to the remaining 1/10 of the data and record correct and incorrect classifications. Repeat. This double cross-validation procedure would give you an unbiased estimate of the accuracy for your discriminant.

4 Comments
Show 2 older commentsHide 2 older comments

Marta on 19 May 2015

Edited: Marta on 19 May 2015

Open in MATLAB Online

Hi Ilya.

Many thanks for your answer.

I want to compute ROC curves for my classifier with a leave-one-out analysis. I have just found out about "cvpartition" which solves my problem. This leads me to the following question:

Why do the ROCs computed using

 % kfoldPredict version
clas = fitcdiscr(data,rec,'DiscrimType','quadratic');
[~,sco2] = kfoldPredict(crossval(clas,'leaveout','on'));

and

 % cvpartition version
c = cvpartition(1:length(rec),'Leaveout');
for it=1:c.NumTestSets
    datatrain = data(c.training(it));
    datatest = data(c.test(it));
    clas = fitcdiscr(datatrain,rec(c.training(it)),'DiscrimType','quadratic');
    [~,scotemp] = predict(clas,datatest);
    sco(it)=scotemp(2);
    lab(it)=rec(c.test(it));
end

look so different?

 % compare ROC curves
[X2,Y2,T2,AUC2,OPTROC2] = perfcurve(rec,sco2(:,2),1);
[X,Y,T,AUC,OPTROC] = perfcurve(lab,sco,1);

Why does the kfoldPredict always give a higher AUC? Is the classifier training on the entire dataset rather than just the training subset?

Thank you for your help,

Marta

Ilya on 27 May 2015

Yes, you found correct answers to your both questions. Sorry, I missed your posts; I guess I just don't look here that often.

Greg Heath on 25 Jun 2015

Open in MATLAB Online

If priors are not equal you can simulate data to equalize the priors I just assume a Gaussian distribution for the simulated data. The mean and covariance are estimated from the actual data.

Then, after training you can multiply the probability estimates by the correct prior ratio and classification costs to estimate the Bayesian Risk.

For an example, Google

unbalanced priors BioID

Hope this helps.

Greg

Sign in to comment.

Cross-validation ROC curve in discriminant analysis

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

4 Comments
Show 2 older commentsHide 2 older comments

See Also

Categories

Tags

Community Treasure Hunt

Cross-validation ROC curve in discriminant analysis

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

4 Comments Show 2 older commentsHide 2 older comments

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

4 Comments
Show 2 older commentsHide 2 older comments