Rank features for classification using minimum redundancy maximum relevance (MRMR) algorithm

ranks features (predictors) using the MRMR algorithm. The
table `idx`

= fscmrmr(`Tbl`

,`ResponseVarName`

)`Tbl`

contains a response variable and predictor variables, and
`ResponseVarName`

is the name of the response variable in
`Tbl`

. The function returns `idx`

, which contains
the indices of predictors ordered by predictor importance. You can use
`idx`

to select important predictors for classification
problems.

specifies additional options using one or more name-value pair arguments in addition to
any of the input argument combinations in the previous syntaxes. For example, you can
specify prior probabilities and observation weights.`idx`

= fscmrmr(___,`Name,Value`

)

Load the sample data.

`load ionosphere`

Rank the predictors based on importance.

[idx,scores] = fscmrmr(X,Y);

Create a bar plot of the predictor importance scores.

bar(scores(idx)) xlabel('Predictor rank') ylabel('Predictor importance score')

The drop in score between the first and second most important predictors is large, while the drops after the sixth predictor are relatively small. A drop in the importance score represents the confidence of feature selection. Therefore, the large drop implies that the software is confident of selecting the most important predictor. The small drops indicate that the difference in predictor importance are not significant.

Select the top five most important predictors. Find the columns of these predictors in `X`

.

idx(1:5)

`ans = `*1×5*
5 4 18 1 7

The fifth column of `X`

is the most important predictor of `Y`

.

Find important predictors by using `fscmrmr`

. Then compare the accuracies of the full classification model (which uses all the predictors) and a reduced model that uses the five most important predictors by using `testckfold`

.

Load the census1994 data set.

`load census1994`

The table `adultdata`

in `census1994`

contains demographic data from the US Census Bureau to predict whether an individual makes over $50,000 per year. Display the first three rows of the table.

head(adultdata,3)

`ans=`*3×15 table*
age workClass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country salary
___ ________________ __________ _________ _____________ __________________ _________________ _____________ _____ ____ ____________ ____________ ______________ ______________ ______
39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
38 Private 2.1565e+05 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K

The output arguments of `fscmrmr`

include only the variables ranked by the function. Before passing a table to the function, move the variables that you do not want to rank, including the response variable and weight, to the end of the table so that the order of the output arguments is consistent with the order of the table.

In the table `adultdata`

, the third column `fnlwgt`

is the weight of the samples, and the last column `salary`

is the response variable. Move `fnlwgt`

to the left of `salary`

by using the `movevars`

function.

adultdata = movevars(adultdata,'fnlwgt','before','salary'); head(adultdata,3)

`ans=`*3×15 table*
age workClass education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country fnlwgt salary
___ ________________ _________ _____________ __________________ _________________ _____________ _____ ____ ____________ ____________ ______________ ______________ __________ ______
39 State-gov Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States 77516 <=50K
50 Self-emp-not-inc Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States 83311 <=50K
38 Private HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States 2.1565e+05 <=50K

Rank the predictors in `adultdata`

. Specify the column `salary`

as the response variable.

[idx,scores] = fscmrmr(adultdata,'salary','Weight','fnlwgt');

Create a bar plot of predictor importance scores. Use the predictor names for the *x*-axis tick labels.

bar(scores(idx)) xlabel('Predictor rank') ylabel('Predictor importance score') xticklabels(strrep(adultdata.Properties.VariableNames(idx),'_','\_')) xtickangle(45)

The five most important predictors are `relationship`

, `capital_loss`

, `capital_gain`

, `education`

, and `hours_per_week`

.

Compare the accuracy of a classification tree trained with all predictors to the accuracy of one trained with the five most important predictors.

Create a classification tree template using the default options.

C = templateTree;

Define the table `tbl1`

to contain all predictors and the table `tbl2`

to contain the five most important predictors.

tbl1 = adultdata(:,adultdata.Properties.VariableNames(idx(1:13))); tbl2 = adultdata(:,adultdata.Properties.VariableNames(idx(1:5)));

Pass the classification tree template and the two tables to the `testckfold`

function. The function compares the accuracies of the two models by repeated cross-validation. Specify `'Alternative','greater'`

to test the null hypothesis that the model with all predictors is, at most, as accurate as the model with the five predictors. The `'greater'`

option is available when `'Test'`

is `'5x2t'`

(5-by-2 paired *t* test) or `'10x10t'`

(10-by-10 repeated cross-validation *t* test).

[h,p] = testckfold(C,C,tbl1,tbl2,adultdata.salary,'Weights',adultdata.fnlwgt,'Alternative','greater','Test','5x2t')

`h = `*logical*
0

p = 0.9981

`h`

equals 0 and the *p*-value is almost 1, indicating failure to reject the null hypothesis. Using the model with the five predictors does not result in loss of accuracy compared to the model with all the predictors.

Now train a classification tree using the selected predictors.

mdl = fitctree(adultdata,'salary ~ relationship + capital_loss + capital_gain + education + hours_per_week', ... 'Weights',adultdata.fnlwgt)

mdl = ClassificationTree PredictorNames: {1x5 cell} ResponseName: 'salary' CategoricalPredictors: [1 2] ClassNames: [<=50K >50K] ScoreTransform: 'none' NumObservations: 32561 Properties, Methods

`Tbl`

— Sample datatable

Sample data, specified as a table. Multicolumn variables and cell arrays other than cell arrays of character vectors are not allowed.

Each row of `Tbl`

corresponds to one observation, and each column
corresponds to one predictor variable. Optionally, `Tbl`

can contain
one additional column for the response variable. The response variable must be a
categorical, character, or string array, logical or numeric vector, or cell array of
character vectors. If the response variable is a character array, then each element of
the response variable must correspond to one row of the array.

If

`Tbl`

does not contain the response variable, then specify a response variable by using`Y`

. The length of the response variable and the number of rows in`Tbl`

must be equal.If

`Tbl`

contains the response variable, and you want to use all remaining variables in`Tbl`

as predictors, then specify the response variable by using`ResponseVarName`

.If

`Tbl`

contains the response variable, and you want to use only a subset of the remaining variables in`Tbl`

as predictors, then specify the subset of variables by using`formula`

.

If `fscmrmr`

uses a subset of variables in
`Tbl`

as predictors, then the function indexes the predictors using
only the subset. The values in the `'CategoricalPredictors'`

name-value pair argument and the output argument `idx`

do not count
the predictors that the function does not rank.

`fscmrmr`

considers `NaN`

, `''`

(empty character vector), `""`

(empty string),
`<missing>`

, and `<undefined>`

values in
`Tbl`

for a response variable to be missing values.
`fscmrmr`

do not use observations with missing values for a
response variable.

**Data Types: **`table`

`ResponseVarName`

— Response variable namecharacter vector or string scalar containing the name of variable in

`Tbl`

Response variable name, specified as a character vector or string scalar containing
the name of a variable in `Tbl`

.

You must specify `ResponseVarName`

as a character vector or string
scalar. For example, if the response variable `Y`

is stored as
`Tbl.Y`

, then specify it as `'Y'`

. Otherwise, the
software treats all columns of `Tbl`

, including `Y`

,
as predictors.

A good practice is to specify the order of the classes by using the
`ClassNames`

name-value pair argument.

**Data Types: **`char`

| `string`

`formula`

— Explanatory model of response variable and subset of predictor variablescharacter vector | string scalar

Explanatory model of the response variable and a subset of the predictor variables,
specified as a character vector or string scalar in the form
`'Y~X1+X2+X3'`

. In this form, `Y`

represents the
response variable, and `X1`

, `X2`

, and
`X3`

represent the predictor variables.

To specify a subset of variables in `Tbl`

as predictors, use a
formula. If you specify a formula, then the software does not use any variables in
`Tbl`

that do not appear in `formula`

.

The variable names in the formula must be both variable names in `Tbl`

(`Tbl.Properties.VariableNames`

) and valid MATLAB^{®} identifiers.

You can verify the variable names in `Tbl`

by using the `isvarname`

function. The following code returns logical `1`

(`true`

) for each variable that has a valid variable name.

cellfun(@isvarname,Tbl.Properties.VariableNames)

`Tbl`

are not valid, then convert them by using the
`matlab.lang.makeValidName`

function.Tbl.Properties.VariableNames = matlab.lang.makeValidName(Tbl.Properties.VariableNames);

**Data Types: **`char`

| `string`

`Y`

— Response variablenumeric vector | categorical vector | logical vector | character array | string array | cell array of character vectors

Response variable, specified as a numeric vector, categorical vector, logical
vector, character array, string array, or cell array of character vectors. Each row of
`Y`

represents the labels of the corresponding row of
`X`

.

`fscmrmr`

considers `NaN`

, `''`

(empty character vector), `""`

(empty string),
`<missing>`

, and `<undefined>`

values in
`Y`

to be missing values. `fscmrmr`

does not use
observations with missing values for `Y`

.

**Data Types: **`single`

| `double`

| `categorical`

| `logical`

| `char`

| `string`

| `cell`

`X`

— Predictor datanumeric matrix

Predictor data, specified as a numeric matrix. Each row of `X`

corresponds to one observation, and each column corresponds to one predictor
variable.

**Data Types: **`single`

| `double`

Specify optional
comma-separated pairs of `Name,Value`

arguments. `Name`

is
the argument name and `Value`

is the corresponding value.
`Name`

must appear inside quotes. You can specify several name and value
pair arguments in any order as
`Name1,Value1,...,NameN,ValueN`

.

`'CategoricalPredictors',[1 2],'Verbose',2`

specifies the first
two predictor variables as categorical variables and specifies the verbosity level as
2.`'CategoricalPredictors'`

— Categorical predictors listvector of positive integers | logical vector | character matrix | string array | cell array of character vectors |

`'all'`

Categorical predictors list, specified as the comma-separated pair consisting of
`'CategoricalPredictors'`

and one of the values in this
table.

Value | Description |
---|---|

Vector of positive integers | Each entry in the vector is an index value corresponding to the column
of the predictor data (`X` or `Tbl` )
that contains a categorical variable. |

Logical vector | A `true` entry means that the corresponding column of
predictor data (`X` or `Tbl` ) is a
categorical variable. |

Character matrix | Each row of the matrix is the name of a predictor variable. The names
must match the names in `Tbl` . Pad the names with extra
blanks so each row of the character matrix has the same length. |

String array or cell array of character vectors | Each element in the array is the name of a predictor variable. The
names must match the names in `Tbl` . |

`'all'` | All predictors are categorical. |

By default, if the predictor data is in a table
(`Tbl`

), `fscmrmr`

assumes that a variable is
categorical if it is a logical vector, unordered categorical vector, character array, string
array, or cell array of character vectors. If the predictor data is a matrix
(`X`

), `fscmrmr`

assumes that all predictors are
continuous. To identify any other predictors as categorical predictors, specify them by using
the `'CategoricalPredictors'`

name-value pair argument.

**Example: **`'CategoricalPredictors','all'`

**Data Types: **`single`

| `double`

| `logical`

| `char`

| `string`

| `cell`

`'ClassNames'`

— Names of classes to use for rankingcategorical array | character array | string array | logical vector | numeric vector | cell array of character vectors

Names of the classes to use for ranking, specified as the comma-separated pair
consisting of `'ClassNames'`

and a categorical, character, or string
array, a logical or numeric vector, or a cell array of character vectors.
`ClassNames`

must have the same data type as `Y`

or the response variable in `Tbl`

.

If `ClassNames`

is a character array, then each element must
correspond to one *row* of the array.

Use `'ClassNames'`

to:

Specify the order of the

`Prior`

dimensions that corresponds to the class order.Select a subset of classes for ranking. For example, suppose that the set of all distinct class names in

`Y`

is`{'a','b','c'}`

. To rank predictors using observations from classes`'a'`

and`'c'`

only, specify`'ClassNames',{'a','c'}`

.

The default value for `'ClassNames'`

is the set of all distinct
class names in `Y`

or the response variable in
`Tbl`

. The default `'ClassNames'`

value has
mathematical ordering if the response variable is ordinal. Otherwise, the default
value has alphabetical ordering.

**Example: **`'ClassNames',{'b','g'}`

**Data Types: **`categorical`

| `char`

| `string`

| `logical`

| `single`

| `double`

| `cell`

`'Prior'`

— Prior probabilities`'empirical'`

(default) | `'uniform'`

| vector of scalar values | structurePrior probabilities for each class, specified as the comma-separated pair consisting of
`'Prior'`

and one of the following:

Character vector or string scalar.

Vector (one scalar value for each class). To specify the class order for the corresponding elements of

`'Prior'`

, also specify the`ClassNames`

name-value pair argument.Structure

`S`

with two fields.`S.ClassNames`

contains the class names as a variable of the same type as the response variable in`Y`

or`Tbl`

.`S.ClassProbs`

contains a vector of corresponding probabilities.

If you set values for both `'Weights'`

and `'Prior'`

,
`fscmrmr`

normalizes the weights in each class to add up to
the value of the prior probability of the respective class.

**Example: **`'Prior','uniform'`

**Data Types: **`char`

| `string`

| `single`

| `double`

| `struct`

`'Weights'`

— Observation weights`ones(size(X,1),1)`

(default) | vector of scalar values | name of variable in `Tbl`

Observation weights, specified as the comma-separated pair consisting of `'Weights'`

and a vector of scalar values or the name of a variable in `Tbl`

. The software weights the observations in each row of `X`

or `Tbl`

with the corresponding value in `Weights`

. The size of `Weights`

must equal the number of rows in `X`

or `Tbl`

.

If you specify the input data as a table `Tbl`

, then
`Weights`

can be the name of a variable in `Tbl`

that contains a numeric vector. In this case, you must specify
`Weights`

as a character vector or string scalar. For example, if
the weights vector `W`

is stored as `Tbl.W`

, then
specify it as `'W'`

. Otherwise, the software treats all columns of
`Tbl`

, including `W`

, as predictors when training
the model.

`fscmrmr`

normalizes the weights in each class to add up to the value
of the prior probability of the respective class.

**Data Types: **`single`

| `double`

| `char`

| `string`

`'Verbose'`

— Verbosity level`0`

(default) | nonnegative integerVerbosity level, specified as the comma-separated pair consisting of
`'Verbose'`

and a nonnegative integer. The value of
`Verbose`

controls the amount of diagnostic information that the
software displays in the Command Window.

0 —

`fscmrmr`

does not display any diagnostic information.1 —

`fscmrmr`

displays the elapsed times for computing Mutual Information and ranking predictors.≥ 2 —

`fscmrmr`

displays the elapsed times and more messages related to computing mutual information. The amount of information increases as you increase the`'Verbose'`

value.

**Example: **`'Verbose',1`

**Data Types: **`single`

| `double`

`idx`

— Indices of predictors ordered by predictor importancenumeric vector

Indices of predictors in `X`

or `Tbl`

ordered
by predictor importance, returned as a 1-by-*p* numeric vector, where
*p* is the number of ranked predictors.

If `fscmrmr`

uses a subset of variables in
`Tbl`

as predictors, then the function indexes the predictors using
only the subset. For example, suppose `Tbl`

includes 10 columns and
you specify the last five columns of `Tbl`

as the predictor variables
using `formula`

. If `idx(3)`

is
`5`

, then the third most important predictor is the 10th column in
`Tbl`

, which is the fifth predictor in the subset. If you use
`X`

to specify predictor variables, then
`fscmrmr`

ranks all predictors in `X`

.
Therefore, if `idx(3)`

is `5`

, then the third most
important predictor is the fifth column in `X`

.

`scores`

— Predictor scoresnumeric vector

Predictor scores, returned as a 1-by-*p* numeric vector, where
*p* is the number of ranked predictors.

A large score value indicates that the corresponding predictor is important. Also, a
drop in the feature importance score represents the confidence of feature selection. For
example, if the software is confident of selecting a feature *x*, then
the score value of the next most important feature is much smaller than the score value
of *x*.

For example, suppose `Tbl`

includes 10 columns and you specify
the last five columns of `Tbl`

as the predictor variables using
`formula`

. Then, `score(3)`

contains the score
value of the 8th column in `Tbl`

, which is the third predictor in the
subset. If you use `X`

to specify predictor variables, then
`score(3)`

contains the score value of the third column in
`X`

.

The mutual information between two variables measures how much uncertainty of one variable can be reduced by knowing the other variable.

The mutual information *I* of the discrete random variables
*X* and *Z* is defined as

$$I\left(X,Z\right)={\displaystyle {\sum}_{i,j}P\left(X={x}_{i},Z={z}_{j}\right)\mathrm{log}\frac{P\left(X={x}_{i},Z={z}_{j}\right)}{P\left(X={x}_{i}\right)P\left(Z={z}_{j}\right)}}.$$

If *X* and *Z* are independent, then
*I* equals 0. If *X* and *Z* are the
same random variable, then *I* equals the entropy of
*X*.

The `fscmrmr`

function uses this definition to compute the mutual
information values for both categorical (discrete) and continuous variables.
`fscmrmr`

discretizes a continuous variable into 256 bins or the number
of unique values in the variable if it is less than 256. The function finds optimal
bivariate bins for each pair of variables using the adaptive algorithm [2].
`fscmrmr`

treats missing values in a categorical variable as an extra
category and places `NaN`

s in continuous variables in a separate
bin.

The MRMR algorithm[1] finds an optimal set of features that is mutually and maximally dissimilar and can represent the response variable effectively. The algorithm minimizes the redundancy of a feature set and maximizing the relevance of a feature set to the response variable. The algorithm quantifies the redundancy and relevance using the mutual information of variables—pairwise mutual information of features and mutual information of a feature and the response. You can use this algorithm for classification problems.

The goal of the MRMR algorithm is to find an optimal set *S* of
features that maximizes *V _{S}*, the relevance of

$${V}_{S}=\frac{1}{\left|S\right|}{\displaystyle {\sum}_{x\in S}I\left(x,y\right)},$$

$${W}_{S}=\frac{1}{{\left|S\right|}^{2}}{\displaystyle {\sum}_{x,z\in S}I\left(x,z\right)}.$$

*|S|* is the number of features in
*S*.

Finding an optimal set *S* requires considering all 2^{|Ω|} combinations, where *Ω* is the entire feature set.
Instead, the MRMR algorithm ranks features through the forward addition scheme, which
requires *O*(|*Ω*|·|*S*|) computations, by using the mutual information quotient (MIQ) value.

$${\text{MIQ}}_{x}=\frac{{V}_{x}}{{W}_{x}},$$

where *V _{x}* and

$${V}_{x}=I(x,y),$$

$${W}_{x}=\frac{1}{\left|S\right|}{\displaystyle {\sum}_{z\in S}I\left(x,z\right)}.$$

The `fscmrmr`

function ranks all features in *Ω* and
returns `idx`

(the indices of features ordered by feature importance)
using the MRMR algorithm. Therefore, the computation cost becomes *O*(|*Ω*|^{2}). The function quantifies the importance of a feature using a heuristic
algorithm and returns `score`

. A large score value indicates that the
corresponding predictor is important. Also, a drop in the feature importance score
represents the confidence of feature selection. For example, if the software is confident of
selecting a feature *x*, then the score value of the next most important
feature is much smaller than the score value of *x*. You can use the
outputs to find an optimal set *S* for a given number of features.

`fscmrmr`

ranks features as follows:

Select the feature with the largest relevance, $$\underset{x\in \Omega}{\mathrm{max}}{V}_{x}$$. Add the selected feature to an empty set

*S*.Find the features with nonzero relevance and zero redundancy in the complement of

*S*,*S*.^{c}If

*S*does not include a feature with nonzero relevance and zero redundancy, go to step 4.^{c}Otherwise, select the feature with the largest relevance, $$\underset{x\in {S}^{c},\text{\hspace{0.17em}}{W}_{x}=0}{\mathrm{max}}{V}_{x}$$. Add the selected feature to the set

*S*.

Repeat Step 2 until the redundancy is not zero for all features in

*S*.^{c}Select the feature that has the largest MIQ value with nonzero relevance and nonzero redundancy in

*S*, and add the selected feature to the set^{c}*S*.$$\underset{x\in {S}^{c}}{\mathrm{max}}{\text{MIQ}}_{x}=\underset{x\in {S}^{c}}{\mathrm{max}}\frac{I(x,y)}{\frac{1}{\left|S\right|}{\displaystyle {\sum}_{z\in S}I\left(x,z\right)}}.$$

Repeat Step 4 until the relevance is zero for all features in

*S*.^{c}Add the features with zero relevance to

*S*in random order.

The software can skip any step if it cannot find a feature that satisfies the conditions described in the step.

[1] Ding, C., and H. Peng. "Minimum
redundancy feature selection from microarray gene expression data." *Journal of
Bioinformatics and Computational Biology.* Vol. 3, Number 2, 2005, pp.
185–205.

[2] Darbellay, G. A., and I. Vajda.
"Estimation of the information by an adaptive partitioning of the observation space."
*IEEE Transactions on Information Theory.* Vol. 45, Number 4, 1999, pp.
1315–1321.

A modified version of this example exists on your system. Do you want to open this version instead?

You clicked a link that corresponds to this MATLAB command:

Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

Select web siteYou can also select a web site from the following list:

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

- América Latina (Español)
- Canada (English)
- United States (English)

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)