Classifier cross validation on grouped observations with different class ratio's
4 views (last 30 days)
For my master thesis I am trying to develop a classification scheme for classifying tumorous vs. healthy tissue using hyperspectral imaging. For this, I am building a database of labeled observations, which I would like to use to determine the optimal classifier and classifier parameters for this particular problem.
I will first try to clarify how my data looks like and what 'problems' it contains, then I will explain my question.
One hyperspectral image is a 3 dimensional matrix. Essentially each hyperspectral 'cube' consists of a stack of 2D grayscale images, where each individual 2D grayscale image corresponds to the reflection intensity of a specimen at a specific wavelength. let this be a MxNxW matrix, where MxN are the spatial dimensions, and W are the number of different wavelengths under consideration. Each hyperspectral cube is normalized between 0 and 1 using certain reference materials.
For each patient under consideration I have taken such a hyperspectral image of some tissue that has been excised during surgery. Using a reference, I have labeled 'tumor' and 'healthy' tissue in these images. e.g. if position (100,100,:) contains tumor and I save this observation as x1 and x1 will have a dimension of 1xW, where W is the number of wavelengths measured (for my case W=256).
So far I have 19 measurements (hyperspectral cubes) from 19 different patients and for each patient, a different number of 'tumour' and 'healthy' pixels are manually labeled due to differences in specimen size. e.g. for patient_1 I might have 800 'tumour' and 2000 'healthy' pixels, whereas for patient_n I might have 2000 'tumour' and 200 'healthy' pixels. (these numbers are representative for the actual ranges in between patients).
For this classification problem there are 2 types of variation that I should take into account. First of all, I expect inter-patient variability, meaning that tumor pixels in patient 1 might be different from those in patient 2. Second of all, I expect variability between measurements taken within each patient. Meaning that tumor pixel 1 in patient 1 might be different than tumor pixel 2 in patient 1. Furthermore, I expect pixels that are close to each other to be fairly similar, i.e. these measurement vectors are highly correlated.
Since I expect that the most variation lies between patients, I was thinking about performing cross validation on 'grouped' observations. Where the patient ID would correspond to different groups.
Apparently stratified 10-fold cross-validation is used most often as it has a good bias - variance tradeoff. However when I would choose to perform splitting of the folds on patient ID's, how would it be possible to obtain similar ratio's of tumour and healthy classes in both the training and testing groups. Considering that for 2 patients I have only 'healthy' class data available.
I can imagine that if I would split my patients into 10 groups, some of these groups would contain a significantly larger number of 'healthy' than 'tumor' observations. If I understand the theory correctly, this will give me a biased estimate of my classification accuracy right?
Furthermore, since in some patients the area of the specimen in large, the number of total observation for that patient is large. Running cross-validation with splitting based on patient ID, this would cause the size of (especially) the validation group to change significantly in size between folds, which would give me a biased estimate as well right?
Would random sampling perhaps be a better option? How about dividing the data in training, validation and testing groups? -> How else to perform model selection and hyper-parameter optimization without introducing significant bias?
The main limitation of my current dataset, I think, is the number of patients that I am able to obtain during the course of my thesis. (max. 30 total). With the data available, I would however like to give accurate estimates of classifier performance.
I would love to hear your insights about this particular problem as I cannot find any meaningfull literature regarding cross validation of 'grouped' variables.
Ilya on 14 Feb 2017
I haven't understood what you mean by "performing cross validation on 'grouped' observations. Where the patient ID would correspond to different groups". If you wanted to stratify by patient ID, you would obtain splittings (folds) with the same fraction of data from a specific patient in each fold. That is, if you have say patient A with Na observations and patient B with Nb observations and if you want 10-fold cross-validation stratified by patient ID, each fold gets 0.1*Na observations for patient A and 0.1*Nb observations for patient B. If that's what you want, I do not understand why you believe that "this would cause the size of (especially) the validation group to change significantly in size between folds". The size of each fold would be 10% of the dataset.
If, on the other hand, you wanted to put all obervations for patient A in one fold, all observations for patient B in another fold and so on, your folds would be of unequal size. But this would not be stratified cross-validation.