Feature selection using neighborhood component analysis for classification
performs
feature selection for classification with additional options specified
by one or more name-value pair arguments.mdl
= fscnca(X
,Y
,Name,Value
)
Generate toy data where the response variable depends on the 3rd, 9th, and 15th predictors.
rng(0,'twister'); % For reproducibility N = 100; X = rand(N,20); y = -ones(N,1); y(X(:,3).*X(:,9)./X(:,15) < 0.4) = 1;
Fit the neighborhood component analysis model for classification.
mdl = fscnca(X,y,'Solver','sgd','Verbose',1);
o Tuning initial learning rate: NumTuningIterations = 20, TuningSubsetSize = 100 |===============================================| | TUNING | TUNING SUBSET | LEARNING | | ITER | FUN VALUE | RATE | |===============================================| | 1 | -3.755936e-01 | 2.000000e-01 | | 2 | -3.950971e-01 | 4.000000e-01 | | 3 | -4.311848e-01 | 8.000000e-01 | | 4 | -4.903195e-01 | 1.600000e+00 | | 5 | -5.630190e-01 | 3.200000e+00 | | 6 | -6.166993e-01 | 6.400000e+00 | | 7 | -6.255669e-01 | 1.280000e+01 | | 8 | -6.255669e-01 | 1.280000e+01 | | 9 | -6.255669e-01 | 1.280000e+01 | | 10 | -6.255669e-01 | 1.280000e+01 | | 11 | -6.255669e-01 | 1.280000e+01 | | 12 | -6.255669e-01 | 1.280000e+01 | | 13 | -6.255669e-01 | 1.280000e+01 | | 14 | -6.279210e-01 | 2.560000e+01 | | 15 | -6.279210e-01 | 2.560000e+01 | | 16 | -6.279210e-01 | 2.560000e+01 | | 17 | -6.279210e-01 | 2.560000e+01 | | 18 | -6.279210e-01 | 2.560000e+01 | | 19 | -6.279210e-01 | 2.560000e+01 | | 20 | -6.279210e-01 | 2.560000e+01 | o Solver = SGD, MiniBatchSize = 10, PassLimit = 5 |==========================================================================================| | PASS | ITER | AVG MINIBATCH | AVG MINIBATCH | NORM STEP | LEARNING | | | | FUN VALUE | NORM GRAD | | RATE | |==========================================================================================| | 0 | 9 | -5.658450e-01 | 4.492407e-02 | 9.290605e-01 | 2.560000e+01 | | 1 | 19 | -6.131382e-01 | 4.923625e-02 | 7.421541e-01 | 1.280000e+01 | | 2 | 29 | -6.225056e-01 | 3.738784e-02 | 3.277588e-01 | 8.533333e+00 | | 3 | 39 | -6.233366e-01 | 4.947901e-02 | 5.431133e-01 | 6.400000e+00 | | 4 | 49 | -6.238576e-01 | 3.445763e-02 | 2.946188e-01 | 5.120000e+00 | Two norm of the final step = 2.946e-01 Relative two norm of the final step = 6.588e-02, TolX = 1.000e-06 EXIT: Iteration or pass limit reached.
Plot the selected features. The weights of the irrelevant features should be close to zero.
figure() plot(mdl.FeatureWeights,'ro') grid on xlabel('Feature index') ylabel('Feature weight')
fscnca
correctly detects the relevant features.
Load sample data
load ovariancancer;
whos
Name Size Bytes Class Attributes grp 216x1 25056 cell obs 216x4000 3456000 single
This example uses the high-resolution ovarian cancer data set that was generated using the WCX2 protein array. After some preprocessing steps, the data set has two variables: obs
and grp
. The obs
variable consists 216 observations with 4000 features. Each element in grp
defines the group to which the corresponding row of obs
belongs.
Divide data into training and test sets
Use cvpartition
to divide data into a training set of size 160 and a test set of size 56. Both the training set and the test set have roughly the same group proportions as in grp
.
rng(1); % For reproducibility cvp = cvpartition(grp,'holdout',56)
cvp = Hold-out cross validation partition NumObservations: 216 NumTestSets: 1 TrainSize: 160 TestSize: 56
Xtrain = obs(cvp.training,:); ytrain = grp(cvp.training,:); Xtest = obs(cvp.test,:); ytest = grp(cvp.test,:);
Determine if feature selection is necessary
Compute generalization error without fitting.
nca = fscnca(Xtrain,ytrain,'FitMethod','none'); L = loss(nca,Xtest,ytest)
L = 0.0893
This option computes the generalization error of the neighborhood component analysis (NCA) feature selection model using the initial feature weights (in this case the default feature weights) provided in fscnca
.
Fit NCA without regularization parameter (Lambda = 0)
nca = fscnca(Xtrain,ytrain,'FitMethod','exact','Lambda',0,... 'Solver','sgd','Standardize',true); L = loss(nca,Xtest,ytest)
L = 0.0714
The improvement on the loss value suggests that feature selection is a good idea. Tuning the value usually improves the results.
Tune the regularization parameter for NCA using five-fold cross-validation
Tuning means finding the value that produces the minimum classification loss. To tune using cross-validation:
1. Partition the training data into five folds and extract the number of validation (test) sets. For each fold, cvpartition
assigns four-fifths of the data as a training set, and one-fifth of the data as a test set.
cvp = cvpartition(ytrain,'kfold',5);
numvalidsets = cvp.NumTestSets;
Assign values and create an array to store the loss function values.
n = length(ytrain); lambdavals = linspace(0,20,20)/n; lossvals = zeros(length(lambdavals),numvalidsets);
2. Train the NCA model for each value, using the training set in each fold.
3. Compute the classification loss for the corresponding test set in the fold using the NCA model. Record the loss value.
4. Repeat this process for all folds and all values.
for i = 1:length(lambdavals) for k = 1:numvalidsets X = Xtrain(cvp.training(k),:); y = ytrain(cvp.training(k),:); Xvalid = Xtrain(cvp.test(k),:); yvalid = ytrain(cvp.test(k),:); nca = fscnca(X,y,'FitMethod','exact', ... 'Solver','sgd','Lambda',lambdavals(i), ... 'IterationLimit',30,'GradientTolerance',1e-4, ... 'Standardize',true); lossvals(i,k) = loss(nca,Xvalid,yvalid,'LossFunction','classiferror'); end end
Compute the average loss obtained from the folds for each value.
meanloss = mean(lossvals,2);
Plot the average loss values versus the values.
figure() plot(lambdavals,meanloss,'ro-') xlabel('Lambda') ylabel('Loss (MSE)') grid on
Find the best lambda value that corresponds to the minimum average loss.
[~,idx] = min(meanloss) % Find the index
idx = 2
bestlambda = lambdavals(idx) % Find the best lambda value
bestlambda = 0.0066
bestloss = meanloss(idx)
bestloss = 0.0250
Fit the nca model on all data using best and plot the feature weights
Use the solver lbfgs and standardize the predictor values.
nca = fscnca(Xtrain,ytrain,'FitMethod','exact','Solver','sgd',... 'Lambda',bestlambda,'Standardize',true,'Verbose',1);
o Tuning initial learning rate: NumTuningIterations = 20, TuningSubsetSize = 100 |===============================================| | TUNING | TUNING SUBSET | LEARNING | | ITER | FUN VALUE | RATE | |===============================================| | 1 | 2.403497e+01 | 2.000000e-01 | | 2 | 2.275050e+01 | 4.000000e-01 | | 3 | 2.036845e+01 | 8.000000e-01 | | 4 | 1.627647e+01 | 1.600000e+00 | | 5 | 1.023512e+01 | 3.200000e+00 | | 6 | 3.864283e+00 | 6.400000e+00 | | 7 | 4.743816e-01 | 1.280000e+01 | | 8 | -7.260138e-01 | 2.560000e+01 | | 9 | -7.260138e-01 | 2.560000e+01 | | 10 | -7.260138e-01 | 2.560000e+01 | | 11 | -7.260138e-01 | 2.560000e+01 | | 12 | -7.260138e-01 | 2.560000e+01 | | 13 | -7.260138e-01 | 2.560000e+01 | | 14 | -7.260138e-01 | 2.560000e+01 | | 15 | -7.260138e-01 | 2.560000e+01 | | 16 | -7.260138e-01 | 2.560000e+01 | | 17 | -7.260138e-01 | 2.560000e+01 | | 18 | -7.260138e-01 | 2.560000e+01 | | 19 | -7.260138e-01 | 2.560000e+01 | | 20 | -7.260138e-01 | 2.560000e+01 | o Solver = SGD, MiniBatchSize = 10, PassLimit = 5 |==========================================================================================| | PASS | ITER | AVG MINIBATCH | AVG MINIBATCH | NORM STEP | LEARNING | | | | FUN VALUE | NORM GRAD | | RATE | |==========================================================================================| | 0 | 9 | 4.016078e+00 | 2.835465e-02 | 5.395984e+00 | 2.560000e+01 | | 1 | 19 | -6.726156e-01 | 6.111354e-02 | 5.021138e-01 | 1.280000e+01 | | 1 | 29 | -8.316555e-01 | 4.024185e-02 | 1.196030e+00 | 1.280000e+01 | | 2 | 39 | -8.838656e-01 | 2.333418e-02 | 1.225839e-01 | 8.533333e+00 | | 3 | 49 | -8.669035e-01 | 3.413150e-02 | 3.421881e-01 | 6.400000e+00 | | 3 | 59 | -8.906935e-01 | 1.946293e-02 | 2.232510e-01 | 6.400000e+00 | | 4 | 69 | -8.778630e-01 | 3.561283e-02 | 3.290643e-01 | 5.120000e+00 | | 4 | 79 | -8.857136e-01 | 2.516633e-02 | 3.902977e-01 | 5.120000e+00 | Two norm of the final step = 3.903e-01 Relative two norm of the final step = 6.171e-03, TolX = 1.000e-06 EXIT: Iteration or pass limit reached.
Plot the feature weights.
figure() plot(nca.FeatureWeights,'ro') xlabel('Feature index') ylabel('Feature weight') grid on
Select features using the feature weights and a relative threshold.
tol = 0.02; selidx = find(nca.FeatureWeights > tol*max(1,max(nca.FeatureWeights)))
selidx = 72×1
565
611
654
681
737
743
744
750
754
839
⋮
Compute the classification loss using the test set.
L = loss(nca,Xtest,ytest)
L = 0.0179
Classify observations using the selected features
Extract the features with feature weights greater than 0 from the training data.
features = Xtrain(:,selidx);
Apply a support vector machine classifier using the selected features to the reduced training set.
svmMdl = fitcsvm(features,ytrain);
Evaluate the accuracy of the trained classifier on the test data which has not been used for selecting features.
L = loss(svmMdl,Xtest(:,selidx),ytest)
L = single
0
X
— Predictor variable valuesPredictor variable values, specified as an n-by-p matrix, where n is the number of observations and p is the number of predictor variables.
Data Types: single
| double
Y
— Class labelsClass labels, specified as a categorical vector, logical vector, numeric vector, string array,
cell array of character vectors of length n, or character matrix with
n rows, where n is the number of observations.
Element i or row i of Y
is
the class label corresponding to row i of X
(observation i).
Data Types: single
| double
| logical
| char
| string
| cell
| categorical
Specify optional
comma-separated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside quotes. You can specify several name and value
pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
'Solver','sgd','Weights',W,'Lambda',0.0003
specifies
the solver as the stochastic gradient descent, the observation weights
as the values in the vector W
, and sets the regularization
parameter at 0.0003.'FitMethod'
— Method for fitting the model'exact'
(default) | 'none'
| 'average'
Method for fitting the model, specified as the comma-separated
pair consisting of 'FitMethod'
and one of the following:
'exact'
— Performs fitting
using all of the data.
'none'
— No fitting. Use
this option to evaluate the generalization error of the NCA model
using the initial feature weights supplied in the call to fscnca.
'average'
— Divides the
data into partitions (subsets), fits each partition using the exact
method,
and returns the average of the feature weights. You can specify the
number of partitions using the NumPartitions
name-value
pair argument.
Example: 'FitMethod','none'
'NumPartitions'
— Number of partitionsmax(2,min(10,n))
(default) | integer between 2 and nNumber of partitions to split the data for using with 'FitMethod','average'
option,
specified as the comma-separated pair consisting of 'NumPartitions'
and
an integer value between 2 and n, where n is
the number of observations.
Example: 'NumPartitions',15
Data Types: double
| single
'Lambda'
— Regularization parameterRegularization parameter to prevent overfitting, specified as the
comma-separated pair consisting of 'Lambda'
and a
nonnegative scalar.
As the number of observations n increases, the chance of overfitting decreases and the required amount of regularization also decreases. See Identify Relevant Features for Classification and Tune Regularization Parameter to Detect Features Using NCA for Classification to learn how to tune the regularization parameter.
Example: 'Lambda',0.002
Data Types: double
| single
'LengthScale'
— Width of the kernel1
(default) | positive real scalarWidth of the kernel, specified as the comma-separated pair consisting
of 'LengthScale'
and a positive real scalar.
A length scale value of 1 is sensible when all predictors are
on the same scale. If the predictors in X
are
of very different magnitudes, then consider standardizing the predictor
values using 'Standardize',true
and setting 'LengthScale',1
.
Example: 'LengthScale',1.5
Data Types: double
| single
'InitialFeatureWeights'
— Initial feature weightsones(p,1)
(default) | p-by-1 vector of real positive scalarsInitial feature weights, specified as the comma-separated pair
consisting of 'InitialFeatureWeights'
and a p-by-1
vector of real positive scalars, where p is the
number of predictors in the training data.
The regularized objective function for optimizing feature weights
is nonconvex. As a result, using different initial feature weights
can give different results. Setting all initial feature weights
to 1 generally works well, but in some cases, random initialization
using rand(p,1)
can give better quality solutions.
Data Types: double
| single
'Weights'
— Observation weightsObservation weights, specified as the comma-separated pair consisting
of 'ObservationWeights'
and an n-by-1
vector of real positive scalars. Use observation weights to specify
higher importance of some observations compared to others. The default
weights assign equal importance to all observations.
Data Types: double
| single
'Prior'
— Prior probabilities for each class'empirical'
(default) | 'uniform'
| structurePrior probabilities for each class, specified as the comma-separated
pair consisting of 'Prior'
and one of the following:
'empirical'
— fscnca
obtains
the prior class probabilities from class frequencies.
'uniform'
— fscnca
sets
all class probabilities equal.
Structure with two fields:
ClassProbs
— Vector of class
probabilities. If these are numeric values with a total greater than
1, fsnca
normalizes them to add up to 1.
ClassNames
— Class names
corresponding to the class probabilities in ClassProbs
.
Example: 'Prior','uniform'
'Standardize'
— Indicator for standardizing predictor datafalse
(default) | true
Indicator for standardizing the predictor data, specified as the comma-separated pair
consisting of 'Standardize'
and either false
or
true
. For more information, see Impact of Standardization.
Example: 'Standardize',true
Data Types: logical
'Verbose'
— Verbosity level indicatorVerbosity level indicator for the convergence summary display,
specified as the comma-separated pair consisting of 'Verbose'
and
one of the following:
0 — No convergence summary
1 — Convergence summary, including norm of gradient and objective function values
> 1 — More convergence information, depending on the fitting algorithm
When using 'minibatch-lbfgs'
solver and verbosity
level > 1, the convergence information includes iteration the log
from intermediate mini-batch LBFGS fits.
Example: 'Verbose',1
Data Types: double
| single
'Solver'
— Solver type'lbfgs'
| 'sgd'
| 'minibatch-lbfgs'
Solver type for estimating feature weights, specified as the
comma-separated pair consisting of 'Solver'
and
one of the following:
'lbfgs'
— Limited memory
Broyden-Fletcher-Goldfarb-Shanno (LBFGS) algorithm
'sgd'
— Stochastic gradient
descent (SGD) algorithm
'minibatch-lbfgs'
— Stochastic
gradient descent with LBFGS algorithm applied to mini-batches
Default is 'lbfgs'
for n ≤
1000, and 'sgd'
for n > 1000.
Example: 'solver','minibatch-lbfgs'
'LossFunction'
— Loss function'classiferror'
(default) | function handleLoss function, specified as the comma-separated pair consisting
of 'LossFunction'
and one of the following.
'classiferror'
— Misclassification error
@
— Custom loss function handle. A loss function has this
form.
lossfun
function L = lossfun(Yu,Yv) % calculation of loss ...
Yu
is a u-by-1 vector and Yv
is a v-by-1 vector. L
is a
u-by-v matrix of loss
values such that L(i,j)
is the loss value for
Yu(i)
and
Yv(j)
.The objective function for minimization includes the loss function l(yi,yj) as follows:
where w is the feature weight vector, n is the number of observations, and p is the number of predictor variables. pij is the probability that xj is the reference point for xi. For details, see NCA Feature Selection for Classification.
Example: 'LossFunction',@lossfun
'CacheSize'
— Memory size1000MB
(default) | integerMemory size, in MB, to use for objective function and gradient
computation, specified as the comma-separated pair consisting of 'CacheSize'
and
an integer.
Example: 'CacheSize',1500MB
Data Types: double
| single
'HessianHistorySize'
— Size of history buffer for Hessian approximation15
(default) | positive integerSize of history buffer for Hessian approximation for the 'lbfgs'
solver,
specified as the comma-separated pair consisting of 'HessianHistorySize'
and
a positive integer. At each iteration the function uses the most recent HessianHistorySize
iterations
to build an approximation to the inverse Hessian.
Example: 'HessianHistorySize',20
Data Types: double
| single
'InitialStepSize'
— Initial step size'auto'
(default) | positive real scalarInitial step size for the 'lbfgs'
solver,
specified as the comma-separated pair consisting of 'InitialStepSize'
and
a positive real scalar. By default, the function determines the initial
step size automatically.
Data Types: double
| single
'LineSearchMethod'
— Line search method'weakwolfe'
(default) | 'strongwolfe'
| 'backtracking'
Line search method, specified as the comma-separated pair consisting
of 'LineSearchMethod'
and one of the following:
'weakwolfe'
— Weak Wolfe
line search
'strongwolfe'
— Strong Wolfe
line search
'backtracking'
— Backtracking
line search
Example: 'LineSearchMethod','backtracking'
'MaxLineSearchIterations'
— Maximum number of line search iterations20
(default) | positive integerMaximum number of line search iterations, specified as the comma-separated
pair consisting of 'MaxLineSearchIterations'
and
a positive integer.
Example: 'MaxLineSearchIterations',25
Data Types: double
| single
'GradientTolerance'
— Relative convergence tolerance1e-6
(default) | positive real scalarRelative convergence tolerance on the gradient norm for solver lbfgs
,
specified as the comma-separated pair consisting of 'GradientTolerance'
and
a positive real scalar.
Example: 'GradientTolerance',0.000002
Data Types: double
| single
'InitialLearningRate'
— Initial learning rate for 'sgd'
solver'auto'
(default) | positive real scalarInitial learning rate for the 'sgd'
solver,
specified as the comma-separated pair consisting of 'InitialLearningRate'
and
a positive real scalar.
When using solver type 'sgd'
, the learning
rate decays over iterations starting with the value specified for 'InitialLearningRate'
.
The default 'auto'
means that the initial
learning rate is determined using experiments on small subsets of
data. Use the NumTuningIterations
name-value
pair argument to specify the number of iterations for automatically
tuning the initial learning rate. Use the TuningSubsetSize
name-value
pair argument to specify the number of observations to use for automatically
tuning the initial learning rate.
For solver type 'minibatch-lbfgs'
, you can
set 'InitialLearningRate'
to a very high value.
In this case, the function applies LBFGS to each mini-batch separately
with initial feature weights from the previous mini-batch.
To make sure the chosen initial learning rate decreases the
objective value with each iteration, plot the Iteration
versus
the Objective
values saved in the mdl.FitInfo
property.
You can use the refit
method with 'InitialFeatureWeights'
equal
to mdl.FeatureWeights
to start from the current
solution and run additional iterations
Example: 'InitialLearningRate',0.9
Data Types: double
| single
'MiniBatchSize'
— Number of observations to use in each batch for the 'sgd'
solverNumber of observations to use in each batch for the 'sgd'
solver,
specified as the comma-separated pair consisting of 'MiniBatchSize'
and
a positive integer from 1 to n.
Example: 'MiniBatchSize',25
Data Types: double
| single
'PassLimit'
— Maximum number of passes for solver 'sgd'
5
(default) | positive integer Maximum number of passes through all n observations
for solver 'sgd'
, specified as the comma-separated
pair consisting of 'PassLimit'
and a positive integer.
Each pass through all of the data is called an epoch.
Example: 'PassLimit',10
Data Types: double
| single
'NumPrint'
— Frequency of batches for displaying convergence summaryFrequency of batches for displaying convergence summary for
the 'sgd'
solver , specified as the comma-separated
pair consisting of 'NumPrint'
and a positive integer.
This argument applies when the 'Verbose'
value
is greater than 0. NumPrint
mini-batches are
processed for each line of the convergence summary that is displayed
on the command line.
Example: 'NumPrint',5
Data Types: double
| single
'NumTuningIterations'
— Number of tuning iterationsNumber of tuning iterations for the 'sgd'
solver,
specified as the comma-separated pair consisting of 'NumTuningIterations'
and
a positive integer. This option is valid only for 'InitialLearningRate','auto'
.
Example: 'NumTuningIterations',15
Data Types: double
| single
'TuningSubsetSize'
— Number of observations to use for tuning initial learning rateNumber of observations to use for tuning the initial learning
rate, specified as the comma-separated pair consisting of 'TuningSubsetSize'
and
a positive integer value from 1 to n. This option
is valid only for 'InitialLearningRate','auto'
.
Example: 'TuningSubsetSize',25
Data Types: double
| single
'IterationLimit'
— Maximum number of iterationsMaximum number of iterations, specified as the comma-separated
pair consisting of 'IterationLimit'
and a positive
integer. The default is 10000 for SGD and 1000 for LBFGS and mini-batch
LBFGS.
Each pass through a batch is an iteration. Each pass through all of the data is an epoch. If the data is divided into k mini-batches, then every epoch is equivalent to k iterations.
Example: 'IterationLimit',250
Data Types: double
| single
'StepTolerance'
— Convergence tolerance on the step sizeConvergence tolerance on the step size, specified as the comma-separated
pair consisting of 'StepTolerance'
and a positive
real scalar. The 'lbfgs'
solver uses an absolute
step tolerance, and the 'sgd'
solver uses a relative
step tolerance.
Example: 'StepTolerance',0.000005
Data Types: double
| single
'MiniBatchLBFGSIterations'
— Maximum number of iterations per mini-batch LBFGS stepMaximum number of iterations per mini-batch LBFGS step, specified
as the comma-separated pair consisting of 'MiniBatchLBFGSIterations'
and
a positive integer.
Example: 'MiniBatchLBFGSIterations',15
Mini-batch LBFGS algorithm is a combination of SGD and LBFGS methods. Therefore, all of the name-value pair arguments that apply to SGD and LBFGS solvers also apply to the mini-batch LBFGS algorithm.
Data Types: double
| single
mdl
— Neighborhood component analysis model for classificationFeatureSelectionNCAClassification
objectNeighborhood component analysis model for classification, returned
as a FeatureSelectionNCAClassification
object.
A modified version of this example exists on your system. Do you want to open this version instead?
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
Select web siteYou can also select a web site from the following list:
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.