# Documentation

## Ensemble Methods

### Framework for Ensemble Learning

You have several methods for melding results from many weak learners into one high-quality ensemble predictor. These methods closely follow the same syntax, so you can try different methods with minor changes in your commands.

Create an ensemble with the `fitensemble` function. Its syntax is

`ens = fitensemble(X,Y,model,numberens,learners)`
• `X` is the matrix of data. Each row contains one observation, and each column contains one predictor variable.

• `Y` is the vector of responses, with the same number of observations as the rows in `X`.

• `model` is a string naming the type of ensemble.

• `numberens` is the number of weak learners in `ens` from each element of `learners`. So the number of elements in `ens` is `numberens` times the number of elements in `learners`.

• `learners` is either a string naming a weak learner, a weak learner template, or a cell array of such templates.

Pictorially, here is the information you need to create an ensemble:

For all classification or nonlinear regression problems, follow these steps to create an ensemble:

#### Put Predictor Data in a Matrix

All supervised learning methods start with a data matrix, usually called `X` in this documentation. Each row of `X` represents one observation. Each column of `X` represents one variable, or predictor.

#### Prepare the Response Data

You can use a wide variety of data types for response data.

• For regression ensembles, `Y` must be a numeric vector with the same number of elements as the number of rows of `X`.

• For classification ensembles, `Y` can be any of the following data types. This table also contains the method of including missing entries.

Data TypeMissing Entry
Numeric vector`NaN`
Categorical vector`<undefined>`
Character arrayRow of spaces
Cell array of strings`''`
Logical vector(not possible to represent)

`fitensemble` ignores missing values in `Y` when creating an ensemble.

For example, suppose your response data consists of three observations in the following order: `true`, `false`, `true`. You could express `Y` as:

• `[1;0;1]` (numeric vector)

• `nominal({'true','false','true'})` (categorical vector)

• `[true;false;true]` (logical vector)

• `['true ';'false';'true ']` (character array, padded with spaces so each row has the same length)

• `{'true','false','true'}` (cell array of strings)

Use whichever data type is most convenient. Because you cannot represent missing values with logical entries, do not use logical entries when you have missing values in `Y`.

#### Choose an Applicable Ensemble Method

`fitensemble` uses one of these algorithms to create an ensemble.

• For classification with two classes:

• `'AdaBoostM1'`

• `'LogitBoost'`

• `'GentleBoost'`

• `'RobustBoost'` (requires an Optimization Toolbox™ license)

• `'LPBoost'` (requires an Optimization Toolbox license)

• `'TotalBoost'` (requires an Optimization Toolbox license)

• `'RUSBoost'`

• `'Subspace'`

• `'Bag'`

• For classification with three or more classes:

• `'AdaBoostM2'`

• `'LPBoost'` (requires an Optimization Toolbox license)

• `'TotalBoost'` (requires an Optimization Toolbox license)

• `'RUSBoost'`

• `'Subspace'`

• `'Bag'`

• For regression:

• `'LSBoost'`

• `'Bag'`

`'Bag'` applies to all methods. When using `'Bag'`, indicate whether you want a classifier or regressor with the `type` name-value pair set to `'classification'` or `'regression'`.

For descriptions of the various algorithms, see Ensemble Algorithms.

This table lists characteristics of the various algorithms. In the table titles:

• `Regress.` — Regression

• `Classif.` — Classification

• `Preds.` — Predictors

• `Imbalance` — Good for imbalanced data (one class has many more observations than the other)

• `Stop` — Algorithm self-terminates

• `Sparse` — Requires fewer weak learners than other ensemble algorithms

AlgorithmRegress.Binary Classif.Binary Classif. Multi- Level Preds.Classif. 3+ ClassesClass ImbalanceStopSparse
`Bag`×× ×
`AdaBoostM1` ×
`AdaBoostM2`   ×
`LogitBoost` ××
`GentleBoost` ××
`RobustBoost` ×
`LPBoost` × × ××
`TotalBoost` × × ××
`RUSBoost` × ××
`LSBoost`×
`Subspace` × ×

`RobustBoost`, `LPBoost`, and `TotalBoost` require an Optimization Toolbox license. Try `TotalBoost` before `LPBoost`, as `TotalBoost` can be more robust.

Suggestions for Choosing an Appropriate Ensemble Algorithm.

• Regression — Your choices are `LSBoost` or `Bag`. See General Characteristics of Ensemble Algorithms for the main differences between boosting and bagging.

• Binary Classification — Try `AdaBoostM1` first, with these modifications:

Data CharacteristicRecommended Algorithm
Many predictors`Subspace`
Skewed data (many more observations of one class)`RUSBoost`
Categorical predictors with over 31 levels`LogitBoost` or `GentleBoost`
Label noise (some training data has the wrong class)`RobustBoost`
Many observationsAvoid `LPBoost`, `TotalBoost`, and `Bag`

• Multiclass Classification — Try `AdaBoostM2` first, with these modifications:

Data CharacteristicRecommended Algorithm
Many predictors`Subspace`
Skewed data (many more observations of one class)`RUSBoost`
Many observationsAvoid `LPBoost`, `TotalBoost`, and `Bag`

For details of the algorithms, see Ensemble Algorithms.

General Characteristics of Ensemble Algorithms.

• `Bag` generally constructs deep trees. This construction is both time consuming and memory-intensive. This also leads to relatively slow predictions.

• `Boost` algorithms generally use very shallow trees. This construction uses relatively little time or memory. However, for effective predictions, boosted trees might need more ensemble members than bagged trees. Therefore it is not always clear which class of algorithms is superior.

• `Bag` can estimate the generalization error without additional cross validation. See `oobLoss`.

• Except for `Subspace`, all boosting and bagging algorithms are based on tree learners. `Subspace` can use either discriminant analysis or k-nearest neighbor learners.

For details of the characteristics of individual ensemble members, see Characteristics of Classification Algorithms.

#### Set the Number of Ensemble Members

Choosing the size of an ensemble involves balancing speed and accuracy.

• Larger ensembles take longer to train and to generate predictions.

• Some ensemble algorithms can become overtrained (inaccurate) when too large.

To set an appropriate size, consider starting with several dozen to several hundred members in an ensemble, training the ensemble, and then checking the ensemble quality, as in Test Ensemble Quality. If it appears that you need more members, add them using the `resume` method (classification) or the `resume` method (regression). Repeat until adding more members does not improve ensemble quality.

 Tip   For classification, the `LPBoost` and `TotalBoost` algorithms are self-terminating, meaning you do not have to investigate the appropriate ensemble size. Try setting `numberens` to `500`. The algorithms usually terminate with fewer members.

#### Prepare the Weak Learners

Currently the weak learner types are:

• `'Discriminant'` (recommended for `Subspace` ensemble)

• `'KNN'` (only for `Subspace` ensemble)

• `'Tree'` (for any ensemble except `Subspace`)

There are two ways to set the weak learner type in the ensemble.

• To create an ensemble with default weak learner options, pass in the string as the weak learner. For example:

```ens = fitensemble(X,Y,'AdaBoostM2',50,'Tree'); % or ens = fitensemble(X,Y,'Subspace',50,'KNN');```
• To create an ensemble with nondefault weak learner options, create a nondefault weak learner using the appropriate `template` method. For example, if you have missing data, and want to use trees with surrogate splits for better accuracy:

```templ = templateTree('Surrogate','all'); ens = fitensemble(X,Y,'AdaBoostM2',50,templ);```

To grow trees with leaves containing a number of observations that is at least 10% of the sample size:

```templ = templateTree('MinLeafSize',size(X,1)/10); ens = fitensemble(X,Y,'AdaBoostM2',50,templ);```

Alternatively, choose the maximal number of splits per tree:

```templ = templateTree('MaxNumSplits',4); ens = fitensemble(X,Y,'AdaBoostM2',50,templ);```

While you can give `fitensemble` a cell array of learner templates, the most common usage is to give just one weak learner template.

For examples using a template, see Example: Unequal Classification Costs and Surrogate Splits.

Decision trees can handle `NaN` values in `X`. Such values are called "missing". If you have some missing values in a row of `X`, a decision tree finds optimal splits using nonmissing values only. If an entire row consists of `NaN`, `fitensemble` ignores that row. If you have data with a large fraction of missing values in `X`, use surrogate decision splits. For examples of surrogate splits, see Example: Unequal Classification Costs and Surrogate Splits.

Common Settings for Tree Weak Learners.

• The depth of a weak learner tree makes a difference for training time, memory usage, and predictive accuracy. You control the depth these parameters:

• `MaxNumSplits` — The maximal number of branch node splits is `MaxNumSplits` per tree. Set large values of `MaxNumSplits` to get deep trees. The default for bagging is `size(X,1) - 1`. The default for boosting is `1`.

• `MinLeafSize` — Each leaf has at least `MinLeafSize` observations. Set small values of `MinLeafSize` to get deep trees. The default for classification is `1` and `5` for regression.

• `MinParentSize` — Each branch node in the tree has at least `MinParentSize` observations. Set small values of `MinParentSize` to get deep trees. The default for classification is `2` and `10` for regression.

If you supply both `MinParentSize` and `MinLeafSize`, the learner uses the setting that gives larger leaves (shallower trees):

`MinParent = max(MinParent,2*MinLeaf)`

If you additionally supply `MaxNumSplits`, then the software splits a tree until one of the three splitting criteria is satisfied.

• `Surrogate` — Grow decision trees with surrogate splits when `Surrogate` is `'on'`. Use surrogate splits when your data has missing values.

 Note:   Surrogate splits cause slower training and use more memory.

#### Call fitensemble

The syntax of `fitensemble` is:

`ens = fitensemble(X,Y,model,numberens,learners)`
• `X` is the matrix of data. Each row contains one observation, and each column contains one predictor variable.

• `Y` is the responses, with the same number of observations as rows in `X`.

• `model` is a string naming the type of ensemble.

• `numberens` is the number of weak learners in `ens` from each element of `learners`. The number of elements in `ens` is `numberens` times the number of elements in `learners`.

• `learners` is a string naming a weak learner, a weak learner template, or a cell array of such strings and templates.

The result of `fitensemble` is an ensemble object, suitable for making predictions on new data. For a basic example of creating a classification ensemble, see Train a Classification Ensemble. For a basic example of creating a regression ensemble, see Train a Regression Ensemble.

Where to Set Name-Value Pairs.  There are several name-value pairs you can pass to `fitensemble`, and several that apply to the weak learners (`templateDiscriminant`, `templateKNN`, and `templateTree`). To determine which name-value pair argument is appropriate, the ensemble or the weak learner:

• Use template name-value pairs to control the characteristics of the weak learners.

• Use `fitensemble` name-value pair arguments to control the ensemble as a whole, either for algorithms or for structure.

For example, for an ensemble of boosted classification trees with each tree deeper than the default, set the `templateTree` name-value pair arguments `MinLeafSize` and `MinParentSize` to smaller values than the defaults. Or, `MaxNumSplits` to a larger value than the defaults. The trees are then leafier (deeper).

To name the predictors in the ensemble (part of the structure of the ensemble), use the `PredictorNames` name-value pair in `fitensemble`.

### Basic Ensemble Examples

#### Train a Classification Ensemble

This example shows how to create a classification tree ensemble for the Fisher iris data, and use it to predict the classification of a flower with average measurements.

```load fisheriris ```

The predictor data is the `meas` matrix and the response data is in the `species` cell array of strings.

For classification trees with three or more classes, Suggestions for Choosing an Appropriate Ensemble Algorithm suggests using the `AdaBoostM2` algorithm.

For this example, arbitrarily choose an ensemble of 100 trees, and use the default tree options.

Train an ensemble of classification trees.

```Mdl = fitensemble(meas,species,'AdaBoostM2',100,'Tree') ```
```Mdl = classreg.learning.classif.ClassificationEnsemble PredictorNames: {'x1' 'x2' 'x3' 'x4'} ResponseName: 'Y' ClassNames: {'setosa' 'versicolor' 'virginica'} ScoreTransform: 'none' NumObservations: 150 NumTrained: 100 Method: 'AdaBoostM2' LearnerNames: {'Tree'} ReasonForTermination: 'Terminated normally after completing the reques...' FitInfo: [100x1 double] FitInfoDescription: {2x1 cell} ```

`Mdl` is a `ClassificationEnsemble` model.

Predict the classification of a flower with average measurements.

```flower = predict(Mdl,mean(meas)) ```
```flower = 'versicolor' ```

#### Train a Regression Ensemble

This example shows how to create a regression ensemble to predict mileage of cars based on their horsepower and weight, trained on the `carsmall` data.

Load the `carsmall` data set.

```load carsmall ```

Prepare the predictor data.

```X = [Horsepower Weight]; ```

The response data is `MPG`. The only available boosted regression ensemble type is `LSBoost`. For this example, arbitrarily choose an ensemble of 100 trees, and use the default tree options.

Train an ensemble of regression trees.

```Mdl = fitensemble(X,MPG,'LSBoost',100,'Tree') ```
```Mdl = classreg.learning.regr.RegressionEnsemble PredictorNames: {'x1' 'x2'} ResponseName: 'Y' ResponseTransform: 'none' NumObservations: 94 NumTrained: 100 Method: 'LSBoost' LearnerNames: {'Tree'} ReasonForTermination: 'Terminated normally after completing the reques...' FitInfo: [100x1 double] FitInfoDescription: {2x1 cell} Regularization: [] ```

Predict the mileage of a car with 150 horsepower weighing 2750 lbs.

```mileage = predict(Mdl,[150 2750]) ```
```mileage = 22.4236 ```

### Test Ensemble Quality

Usually you cannot evaluate the predictive quality of an ensemble based on its performance on training data. Ensembles tend to "overtrain," meaning they produce overly optimistic estimates of their predictive power. This means the result of `resubLoss` for classification (`resubLoss` for regression) usually indicates lower error than you get on new data.

To obtain a better idea of the quality of an ensemble, use one of these methods:

• Evaluate the ensemble on an independent test set (useful when you have a lot of training data).

• Evaluate the ensemble by cross validation (useful when you don't have a lot of training data).

• Evaluate the ensemble on out-of-bag data (useful when you create a bagged ensemble with `fitensemble`).

#### Test Ensemble Quality

This example uses a bagged ensemble so it can use all three methods of evaluating ensemble quality.

Generate an artificial dataset with 20 predictors. Each entry is a random number from 0 to 1. The initial classification is if and otherwise.

```rng(1,'twister') % for reproducibility X = rand(2000,20); Y = sum(X(:,1:5),2) > 2.5; ```

In addition, to add noise to the results, randomly switch 10% of the classifications:

```idx = randsample(2000,200); Y(idx) = ~Y(idx); ```

Independent Test Set

Create independent training and test sets of data. Use 70% of the data for a training set by calling `cvpartition` using the `holdout` option:

```cvpart = cvpartition(Y,'holdout',0.3); Xtrain = X(training(cvpart),:); Ytrain = Y(training(cvpart),:); Xtest = X(test(cvpart),:); Ytest = Y(test(cvpart),:); ```

Create a bagged classification ensemble of 200 trees from the training data:

```bag = fitensemble(Xtrain,Ytrain,'Bag',200,'Tree',... 'Type','Classification') ```
```bag = classreg.learning.classif.ClassificationBaggedEnsemble PredictorNames: {1x20 cell} ResponseName: 'Y' ClassNames: [0 1] ScoreTransform: 'none' NumObservations: 1400 NumTrained: 200 Method: 'Bag' LearnerNames: {'Tree'} ReasonForTermination: 'Terminated normally after completing the reques...' FitInfo: [] FitInfoDescription: 'None' FResample: 1 Replace: 1 UseObsForLearner: [1400x200 logical] ```

Plot the loss (misclassification) of the test data as a function of the number of trained trees in the ensemble:

```figure; plot(loss(bag,Xtest,Ytest,'mode','cumulative')); xlabel('Number of trees'); ylabel('Test classification error'); ```

Cross Validation

Generate a five-fold cross-validated bagged ensemble:

```cv = fitensemble(X,Y,'Bag',200,'Tree',... 'type','classification','kfold',5) ```
```cv = classreg.learning.partition.ClassificationPartitionedEnsemble CrossValidatedModel: 'Bag' PredictorNames: {1x20 cell} ResponseName: 'Y' NumObservations: 2000 KFold: 5 Partition: [1x1 cvpartition] NumTrainedPerFold: [200 200 200 200 200] ClassNames: [0 1] ScoreTransform: 'none' ```

Examine the cross-validation loss as a function of the number of trees in the ensemble:

```figure; plot(loss(bag,Xtest,Ytest,'mode','cumulative')); hold on; plot(kfoldLoss(cv,'mode','cumulative'),'r.'); hold off; xlabel('Number of trees'); ylabel('Classification error'); legend('Test','Cross-validation','Location','NE'); ```

Cross validating gives comparable estimates to those of the independent set.

Out-of-Bag Estimates

Generate the loss curve for out-of-bag estimates, and plot it along with the other curves:

```figure; plot(loss(bag,Xtest,Ytest,'mode','cumulative')); hold on; plot(kfoldLoss(cv,'mode','cumulative'),'r.'); plot(oobLoss(bag,'mode','cumulative'),'k--'); hold off; xlabel('Number of trees'); ylabel('Classification error'); legend('Test','Cross-validation','Out of bag','Location','NE'); ```

The out-of-bag estimates are again comparable to those of the other methods.

### Classification with Imbalanced Data

This example shows how to classify when one class has many more observations than another. Try the `RUSBoost` algorithm first, because it is designed to handle this case.

This example uses the "Cover type" data from the UCI machine learning archive, described in `http://archive.ics.uci.edu/ml/datasets/Covertype`. The data classifies types of forest (ground cover), based on predictors such as elevation, soil type, and distance to water. The data has over 500,000 observations and over 50 predictors, so training and using a classifier is time consuming.

Blackard and Dean [3] describe a neural net classification of this data. They quote a 70.6% classification accuracy. `RUSBoost` obtains over 76% classification accuracy; see steps 6 and 7.

Step 1. Obtain the data.

`urlwrite('http://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz','forestcover.gz');`

Then, extract the data from the `forestcover.gz` file. The data is in the `covtype.data` file.

Step 2. Import the data and prepare it for classification.

Import the data into your workspace. Extract the last data column into a variable named `Y`.

```load covtype.data Y = covtype(:,end); covtype(:,end) = [];```

Step 3. Examine the response data.

`tabulate(Y)`
``` Value Count Percent 1 211840 36.46% 2 283301 48.76% 3 35754 6.15% 4 2747 0.47% 5 9493 1.63% 6 17367 2.99% 7 20510 3.53%```

There are hundreds of thousands of data points. Those of class `4` are less than 0.5% of the total. This imbalance indicates that `RUSBoost` is an appropriate algorithm.

Step 4. Partition the data for quality assessment.

Use half the data to fit a classifier, and half to examine the quality of the resulting classifier.

```part = cvpartition(Y,'holdout',0.5); istrain = training(part); % data for fitting istest = test(part); % data for quality assessment tabulate(Y(istrain))```
``` Value Count Percent 1 105920 36.46% 2 141651 48.76% 3 17877 6.15% 4 1374 0.47% 5 4746 1.63% 6 8683 2.99% 7 10255 3.53%```

Step 5. Create the ensemble.

Use deep trees for higher ensemble accuracy. To do so, set the trees to have minimal leaf size of `5`. Set `LearnRate` to `0.1` in order to achieve higher accuracy as well. The data is large, and, with deep trees, creating the ensemble is time consuming.

```t = templateTree('MinLeafSize',5); tic rusTree = fitensemble(covtype(istrain,:),Y(istrain),'RUSBoost',1000,t,... 'LearnRate',0.1,'nprint',100); toc```
```Training RUSBoost... Grown weak learners: 100 Grown weak learners: 200 Grown weak learners: 300 Grown weak learners: 400 Grown weak learners: 500 Grown weak learners: 600 Grown weak learners: 700 Grown weak learners: 800 Grown weak learners: 900 Grown weak learners: 1000 Elapsed time is 918.258401 seconds.```

Step 6. Inspect the classification error.

Plot the classification error against the number of members in the ensemble.

```figure; tic plot(loss(rusTree,covtype(istest,:),Y(istest),'mode','cumulative')); toc grid on; xlabel('Number of trees'); ylabel('Test classification error');```
`Elapsed time is 775.646935 seconds.`

The ensemble achieves a classification error of under 24% using 150 or more trees. It achieves the lowest error for 400 or more trees.

Examine the confusion matrix for each class as a percentage of the true class.

```tic Yfit = predict(rusTree,covtype(istest,:)); toc tab = tabulate(Y(istest)); bsxfun(@rdivide,confusionmat(Y(istest),Yfit),tab(:,2))*100```
```Elapsed time is 427.293168 seconds. ans = Columns 1 through 6 83.3771 7.4056 0.0736 0 1.7051 0.2681 18.3156 66.4652 2.1193 0.0162 9.3435 2.8239 0 0.0839 90.8038 2.3885 0.6545 6.0693 0 0 2.4763 95.8485 0 1.6752 0 0.2739 0.6530 0 98.6518 0.4213 0 0.1036 3.8346 1.1400 0.4030 94.5187 0.2340 0 0 0 0.0195 0 Column 7 7.1705 0.9163 0 0 0 0 99.7465```

All classes except class 2 have over 80% classification accuracy, and classes 3 through 7 have over 90% accuracy. But class 2 makes up close to half the data, so the overall accuracy is not that high.

Step 7. Compact the ensemble.

The ensemble is large. Remove the data using the `compact` method.

```cmpctRus = compact(rusTree); sz(1) = whos('rusTree'); sz(2) = whos('cmpctRus'); [sz(1).bytes sz(2).bytes]```
```ans = 1.0e+09 * 1.6947 0.9790```

The compacted ensemble is about half the size of the original.

Remove half the trees from `cmpctRus`. This action is likely to have minimal effect on the predictive performance, based on the observation that 400 out of 1000 trees give nearly optimal accuracy.

```cmpctRus = removeLearners(cmpctRus,[500:1000]); sz(3) = whos('cmpctRus'); sz(3).bytes```
```ans = 475495669```

The reduced compact ensemble takes about a quarter the memory of the full ensemble. Its overall loss rate is under 24%:

`L = loss(cmpctRus,covtype(istest,:),Y(istest))`
```L = 0.2326```

The predictive accuracy on new data might differ, because the ensemble accuracy might be biased. The bias arises because the same data used for assessing the ensemble was used for reducing the ensemble size. To obtain an unbiased estimate of requisite ensemble size, you should use cross validation. However, that procedure is time consuming.

### Classification: Imbalanced Data or Unequal Misclassification Costs

In many real-world applications, you might prefer to treat classes in your data asymmetrically. For example, you might have data with many more observations of one class than of any other. Or you might work on a problem in which misclassifying observations of one class has more severe consequences than misclassifying observations of another class. In such situations, you can use two optional parameters for `fitensemble`: `prior` and `cost`.

By using `prior`, you set prior class probabilities (that is, class probabilities used for training). Use this option if some classes are under- or overrepresented in your training set. For example, you might obtain your training data by simulation. Because simulating class `A` is more expensive than class `B`, you opt to generate fewer observations of class `A` and more observations of class `B`. You expect, however, that class `A` and class `B` are mixed in a different proportion in the real world. In this case, set prior probabilities for class `A` and `B` approximately to the values you expect to observe in the real world. `fitensemble` normalizes prior probabilities to make them add up to `1`; multiplying all prior probabilities by the same positive factor does not affect the result of classification.

If classes are adequately represented in the training data but you want to treat them asymmetrically, use the `cost` parameter. Suppose you want to classify benign and malignant tumors in cancer patients. Failure to identify a malignant tumor (false negative) has far more severe consequences than misidentifying benign as malignant (false positive). You should assign high cost to misidentifying malignant as benign and low cost to misidentifying benign as malignant.

You must pass misclassification costs as a square matrix with nonnegative elements. Element `C(i,j)` of this matrix is the cost of classifying an observation into class `j` if the true class is `i`. The diagonal elements `C(i,i)` of the cost matrix must be `0`. For the previous example, you can choose malignant tumor to be class 1 and benign tumor to be class 2. Then you can set the cost matrix to

$\left[\begin{array}{cc}0& c\\ 1& 0\end{array}\right]$

where c > 1 is the cost of misidentifying a malignant tumor as benign. Costs are relative—multiplying all costs by the same positive factor does not affect the result of classification.

If you have only two classes, `fitensemble` adjusts their prior probabilities using ${\stackrel{˜}{P}}_{i}={C}_{ij}{P}_{i}$for class i = 1,2 and j ≠ i. Pi are prior probabilities either passed into `fitensemble` or computed from class frequencies in the training data, and ${\stackrel{˜}{P}}_{i}$ are adjusted prior probabilities. Then `fitensemble` uses the default cost matrix

$\left[\begin{array}{cc}0& 1\\ 1& 0\end{array}\right]$

and these adjusted probabilities for training its weak learners. Manipulating the cost matrix is thus equivalent to manipulating the prior probabilities.

If you have three or more classes, `fitensemble` also converts input costs into adjusted prior probabilities. This conversion is more complex. First, `fitensemble` attempts to solve a matrix equation described in Zhou and Liu [31]. If it fails to find a solution, `fitensemble` applies the "average cost" adjustment described in Breiman et al. [10]. For more information, see Zadrozny, Langford, and Abe [30].

#### Example: Unequal Classification Costs

This example uses data on patients with hepatitis to see if they live or die as a result of the disease. The data set is described at `http://archive.ics.uci.edu/ml/datasets/Hepatitis`.

1. Read the hepatitis data set from the UCI repository as a character array. Then convert the result to a cell array of strings using `textscan`. Specify a cell array of strings containing the variable names.

```hepatitis = textscan(urlread(['http://archive.ics.uci.edu/ml/' ... 'machine-learning-databases/hepatitis/hepatitis.data']),... '%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f','TreatAsEmpty','?',... 'Delimiter',','); size(hepatitis) VarNames = {'dieOrLive' 'age' 'sex' 'steroid' 'antivirals' 'fatigue' ... 'malaise' 'anorexia' 'liverBig' 'liverFirm' 'spleen' ... 'spiders' 'ascites' 'varices' 'bilirubin' 'alkPhosphate' 'sgot' ... 'albumin' 'protime' 'histology'};```
```ans = 1 20```

`hepatitis` is a `1`-by-`20` cell array of strings. The cells correspond to the response (`liveOrDie`) and 19 heterogeneous predictors.

2. Specify a numeric matrix containing the predictors and a cell vector containing the strings `'Die'` and `'Live'`, which are response categories. The response contains two values: `1` indicates that a patient died, and `2` indicates that a patient lived. Specify a cell vector of strings for the response using the response categories. The first variable in `hepatitis` contains the response.

```X = cell2mat(hepatitis(2:end)); ClassNames = {'Die' 'Live'}; Y = ClassNames(hepatitis{:,1});```

`X` is a numeric matrix containing the 19 predictors. `Y` is a cell array of strings containing the response.

3. Inspect the data for missing values.

```figure; barh(sum(isnan(X),1)/size(X,1)); h = gca; h.YTick = 1:numel(VarNames) - 1; h.YTickLabel = VarNames(2:end); ylabel 'Predictor'; xlabel 'Fraction of missing values';```

Most predictors have missing values, and one has nearly 45% of the missing values. Therefore, use decision trees with surrogate splits for better accuracy. Because the data set is small, training time with surrogate splits should be tolerable.

4. Create a classification tree template that uses surrogate splits.

```rng(0,'twister') % for reproducibility t = templateTree('surrogate','all');```
5. Examine the data or the description of the data to see which predictors are categorical.

`X(1:5,:)`
```ans = Columns 1 through 6 30.0000 2.0000 1.0000 2.0000 2.0000 2.0000 50.0000 1.0000 1.0000 2.0000 1.0000 2.0000 78.0000 1.0000 2.0000 2.0000 1.0000 2.0000 31.0000 1.0000 NaN 1.0000 2.0000 2.0000 34.0000 1.0000 2.0000 2.0000 2.0000 2.0000 Columns 7 through 12 2.0000 1.0000 2.0000 2.0000 2.0000 2.0000 2.0000 1.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 Columns 13 through 18 2.0000 1.0000 85.0000 18.0000 4.0000 NaN 2.0000 0.9000 135.0000 42.0000 3.5000 NaN 2.0000 0.7000 96.0000 32.0000 4.0000 NaN 2.0000 0.7000 46.0000 52.0000 4.0000 80.0000 2.0000 1.0000 NaN 200.0000 4.0000 NaN Column 19 1.0000 1.0000 1.0000 1.0000 1.0000```

It appears that predictors `2` through `13` are categorical, as well as predictor `19`. You can confirm this inference using the data set description at `http://archive.ics.uci.edu/ml/datasets/Hepatitis`.

6. List the categorical variables.

`catIdx = [2:13,19];`
7. Create a cross-validated ensemble using 150 learners and the `GentleBoost` algorithm.

```Ensemble = fitensemble(X,Y,'GentleBoost',150,t,... 'PredictorNames',VarNames(2:end),'LearnRate',0.1,... 'CategoricalPredictors',catIdx,'KFold',5); figure; plot(kfoldLoss(Ensemble,'Mode','cumulative','LossFun','exponential')); xlabel('Number of trees'); ylabel('Cross-validated exponential loss');```

8. Inspect the confusion matrix to see which patients the ensemble predicts correctly.

```[yFit,sFit] = kfoldPredict(Ensemble); confusionmat(Y,yFit,'Order',ClassNames)```
```ans = 18 14 11 112```

Of the 123 patient who live, the ensemble predicts correctly that 112 will live. But for the 32 patients who die of hepatitis, the ensemble only predicts correctly that about half will die of hepatitis.

9. There are two types of error in the predictions of the ensemble:

• Predicting that the patient lives, but the patient dies

• Predicting that the patient dies, but the patient lives

Suppose you believe that the first error is five times worse than the second. Create a new classification cost matrix that reflects this belief.

```cost.ClassNames = ClassNames; cost.ClassificationCosts = [0 5; 1 0];```
10. Create a new cross-validated ensemble using `cost` as the misclassification cost, and inspect the resulting confusion matrix.

```EnsembleCost = fitensemble(X,Y,'GentleBoost',150,t,... 'PredictorNames',VarNames(2:end),'LearnRate',0.1,... 'CategoricalPredictors',catIdx,'KFold',5,... 'Cost',cost); [yFitCost,sFitCost] = kfoldPredict(EnsembleCost); confusionmat(Y,yFitCost,'Order',ClassNames)```
```ans = 19 13 8 115```

As expected, the new ensemble does a better job classifying the who die. Somewhat surprisingly, the new ensemble also does a better job classifying the who live, though the result is not statistically significantly better. The results of the cross validation are random, so this result is simply a statistical fluctuation. The result seems to indicate that the classification of who live is not very sensitive to the cost.

### Classification with Many Categorical Levels

Generally, you cannot use classification with more than 31 levels in any categorical predictor. However, two boosting algorithms can classify data with many categorical predictor levels and binary responses: `LogitBoost` and `GentleBoost`. For details, see LogitBoost and GentleBoost.

This example uses demographic data from the U.S. Census Bureau, available at `http://archive.ics.uci.edu/ml/machine-learning-databases/adult/`. The objective of the researchers who posted the data is predicting whether an individual makes over \$50,000 a year, based on a set of characteristics. You can see details of the data, including predictor names, in the `adult.names` file at the site.

1. Load the `'adult.data'` file from the UCI Machine Learning Repository. Specify a cell array of strings containing the variable names.

```adult = urlread(['http://archive.ics.uci.edu/ml/'... 'machine-learning-databases/adult/adult.data']); VarNames = {'age' 'workclass' 'fnlwgt' 'education' 'educationNum'... 'maritalStatus' 'occupation' 'relationship' 'race'... 'sex' 'capitalGain' 'capitalLoss'... 'hoursPerWeek' 'nativeCountry' 'income'}; ```
2. `adult.data` represents missing data as `'?'`. Replace instances of missing data with an empty string. Use `textscan` to put the data into a cell array of strings.

```adult = strrep(adult,'?',''); adult = textscan(adult,'%f%s%f%s%f%s%s%s%s%s%f%f%f%s%s',... 'Delimiter',',','TreatAsEmpty','');```

The name-value pair argument `TreatAsEmpty` converts all observations corresponding to numeric variables to `NaN` if the observation is an empty string.

3. Since the variables are heterogeneous, put the set into a MATLAB® table.

```adult = table(adult{:},'VariableNames',VarNames); ```
4. Some categorical variables have many levels. Plot the number of levels of each categorical predictor.

```cat = varfun(@iscellstr,adult(:,1:end - 1),... 'OutputFormat','uniform'); % Logical flag for categorical variables catVars = find(cat); % Indices of categorical variables countCats = @(var)numel(categories(nominal(var))); numCat = varfun(@(var)countCats(var),adult(:,catVars),... 'OutputFormat','uniform'); figure; barh(numCat); h = gca; h.YTickLabel = VarNames(catVars); ylabel 'Predictor'; xlabel 'Number of categories'; ```

The anonymous function `countCats` converts a predictor to a nominal array, then counts the unique, nonempty categories of the predictor. Predictor 14 (`'nativeCountry'`) has more than 40 categorical levels. For binary classification, `fitctree` uses a computational shortcut to find an optimal split for categorical predictors with many categories. For classification with more than two classes, you can choose a heuristic algorithm to find a good split. For details, see Splitting Categorical Predictors.

5. Specify the predictor matrix using `classreg.regr.modelutils.predictormatrix` and the response vector.

```X = classreg.regr.modelutils.predictormatrix(adult,'ResponseVar',... size(adult,2)); Y = nominal(adult.income); ```

`X` is a numeric matrix; `predictormatrix` converts all categorical variables into group indices. The name-value pair argument `ResponseVar` indicates that the last column is the response variable, and excludes it from the predictor matrix. `Y` is a nominal, categorical array.

6. Train classification ensembles using both `LogitBoost` and `GentleBoost`.

```rng(1); % For reproducibility LBEnsemble = fitensemble(X,Y,'LogitBoost',300,'Tree',... 'CategoricalPredictors',cat,'PredictorNames',VarNames(1:end-1),... 'ResponseName','income'); GBEnsemble = fitensemble(X,Y,'GentleBoost',300,'Tree',... 'CategoricalPredictors',cat,'PredictorNames',VarNames(1:end-1),... 'ResponseName','income');```
7. Examine the resubstitution error for both ensembles.

```figure; plot(resubLoss(LBEnsemble,'Mode','cumulative')); hold on plot(resubLoss(GBEnsemble,'Mode','cumulative'),'r--'); hold off xlabel('Number of trees'); ylabel('Resubstitution error'); legend('LogitBoost','GentleBoost','Location','NE');```

The `GentleBoost` algorithm has a slightly smaller resubstitution error.

8. Estimate the generalization error for both algorithms by cross validation.

```CVLBEnsemble = crossval(LBEnsemble,'KFold',5); CVGBEnsemble = crossval(GBEnsemble,'KFold',5); figure; plot(kfoldLoss(CVLBEnsemble,'Mode','cumulative')); hold on plot(kfoldLoss(CVGBEnsemble,'Mode','cumulative'),'r--'); hold off xlabel('Number of trees'); ylabel('Cross-validated error'); legend('LogitBoost','GentleBoost','Location','NE');```

The cross-validated loss is nearly the same as the resubstitution error.

### Surrogate Splits

When you have missing data, trees and ensembles of trees give better predictions when they include surrogate splits. Furthermore, estimates of predictor importance are often different with surrogate splits. Eliminating unimportant predictors can save time and memory for predictions, and can make predictions easier to understand.

This example shows the effects of surrogate splits for predictions for data containing missing entries in the test set.

Load sample data. Partition it into a training and test set.

```load ionosphere; rng(10) % for reproducibility cv = cvpartition(Y,'Holdout',0.3); Xtrain = X(training(cv),:); Ytrain = Y(training(cv)); Xtest = X(test(cv),:); Ytest = Y(test(cv)); ```

Bag decision trees with and without surrogate splits.

```b = fitensemble(Xtrain,Ytrain,'Bag',50,'Tree',... 'Type','Class'); templS = templateTree('Surrogate','On'); bs = fitensemble(Xtrain,Ytrain,'Bag',50,templS,... 'Type','Class'); ```

Suppose half of the values in the test set are missing.

```Xtest(rand(size(Xtest))>0.5) = NaN; ```

Test accuracy with and without surrogate splits.

```figure; plot(loss(b,Xtest,Ytest,'Mode','Cumulative')); hold on; plot(loss(bs,Xtest,Ytest,'Mode','Cumulative'),'r--'); legend('Regular trees','Trees with surrogate splits'); xlabel('Number of trees'); ylabel('Test classification error'); ```

Check the statistical significance of the difference in results with the McNemar test. Convert the labels to a `nominal` data type to make it easier to check for equality.

```Yfit = nominal(predict(b,Xtest)); YfitS = nominal(predict(bs,Xtest)); N10 = sum(Yfit==nominal(Ytest) & YfitS~=nominal(Ytest)); N01 = sum(Yfit~=nominal(Ytest) & YfitS==nominal(Ytest)); mcnemar = (abs(N10-N01) - 1)^2/(N10+N01); pval = 1 - chi2cdf(mcnemar,1) ```
```pval = 1.7683e-04 ```

The extremely low p-value indicates that the ensemble with surrogate splits is better in a statistically significant manner.

### LPBoost and TotalBoost for Small Ensembles

This example shows how to obtain the benefits of the `LPBoost` and `TotalBoost` algorithms. These algorithms share two beneficial characteristics:

They are self-terminating, so you don't have to guess how many members to include.

They produce ensembles with some very small weights, so you can safely remove ensemble members.

Note that the algorithms in this example require an Optimization Toolbox™ license.

Load the `ionosphere` data set.

```load ionosphere ```

Create the classification ensembles

Create ensembles for classifying the `ionosphere` data using the `LPBoost`, `TotalBoost`, and, for comparison, `AdaBoostM1` algorithms. It is hard to know how many members to include in an ensemble. For `LPBoost` and `TotalBoost`, try using `500`. For comparison, also use `500` for `AdaBoostM1`.

```rng default % For reproducibility T = 500; adaStump = fitensemble(X,Y,'AdaBoostM1',T,'Tree'); totalStump = fitensemble(X,Y,'TotalBoost',T,'Tree'); lpStump = fitensemble(X,Y,'LPBoost',T,'Tree'); figure; plot(resubLoss(adaStump,'Mode','Cumulative')); hold on plot(resubLoss(totalStump,'Mode','Cumulative'),'r'); plot(resubLoss(lpStump,'Mode','Cumulative'),'g'); hold off xlabel('Number of stumps'); ylabel('Training error'); legend('AdaBoost','TotalBoost','LPBoost','Location','NE'); ```

All three algorithms achieve perfect prediction on the training data after a while.

Examine the number of members in all three ensembles.

```[adaStump.NTrained totalStump.NTrained lpStump.NTrained] ```
```ans = 500 52 69 ```

`AdaBoostM1` trained all `500` members. The other two algorithms stopped training early.

Cross validate the ensembles

Cross validate the ensembles to better determine ensemble accuracy.

```cvlp = crossval(lpStump,'KFold',5); cvtotal = crossval(totalStump,'KFold',5); cvada = crossval(adaStump,'KFold',5); figure; plot(kfoldLoss(cvada,'Mode','Cumulative')); hold on plot(kfoldLoss(cvtotal,'Mode','Cumulative'),'r'); plot(kfoldLoss(cvlp,'Mode','Cumulative'),'g'); hold off xlabel('Ensemble size'); ylabel('Cross-validated error'); legend('AdaBoost','TotalBoost','LPBoost','Location','NE'); ```

It appears that each boosting algorithms achieves 10% or lower loss with 50 ensemble members, and `AdaBoostM1` achieves near 6% error with 150 or more ensemble members.

Compact and remove ensemble members

To reduce the ensemble sizes, compact them, and then use `removeLearners`. The question is, how many learners should you remove? The cross-validated loss curves give you one measure. For another, examine the learner weights for `LPBoost` and `TotalBoost` after compacting.

```cada = compact(adaStump); clp = compact(lpStump); ctotal = compact(totalStump); figure subplot(2,1,1) plot(clp.TrainedWeights) title('LPBoost weights') subplot(2,1,2) plot(ctotal.TrainedWeights) title('TotalBoost weights') ```

Both `LPBoost` and `TotalBoost` show clear points where the ensemble member weights become negligible.

Remove the unimportant ensemble members.

```cada = removeLearners(cada,150:cada.NTrained); clp = removeLearners(clp,60:clp.NTrained); ctotal = removeLearners(ctotal,40:ctotal.NTrained); ```

Check that removing these learners does not affect ensemble accuracy on the training data.

```[loss(cada,X,Y) loss(clp,X,Y) loss(ctotal,X,Y)] ```
```ans = 0 0 0 ```

Check the resulting compact ensemble sizes.

```s(1) = whos('cada'); s(2) = whos('clp'); s(3) = whos('ctotal'); s.bytes ```
```ans = 543067 ans = 216603 ans = 144063 ```

The sizes of the compact ensembles are approximately proportional to the number of members in each.

### Ensemble Regularization

Regularization is a process of choosing fewer weak learners for an ensemble in a way that does not diminish predictive performance. Currently you can regularize regression ensembles. (You can also regularize a discriminant analysis classifier in a non-ensemble context; see Regularize a Discriminant Analysis Classifier.)

The `regularize` method finds an optimal set of learner weights αt that minimize

$\sum _{n=1}^{N}{w}_{n}g\left(\left(\sum _{t=1}^{T}{\alpha }_{t}{h}_{t}\left({x}_{n}\right)\right),{y}_{n}\right)+\lambda \sum _{t=1}^{T}|{\alpha }_{t}|.$

Here

• λ ≥ 0 is a parameter you provide, called the lasso parameter.

• ht is a weak learner in the ensemble trained on N observations with predictors xn, responses yn, and weights wn.

• g(f,y) = (fy)2 is the squared error.

The ensemble is regularized on the same (xn,yn,wn) data used for training, so

$\sum _{n=1}^{N}{w}_{n}g\left(\left(\sum _{t=1}^{T}{\alpha }_{t}{h}_{t}\left({x}_{n}\right)\right),{y}_{n}\right)$

is the ensemble resubstitution error. The error is measured by mean squared error (MSE).

If you use λ = 0, `regularize` finds the weak learner weights by minimizing the resubstitution MSE. Ensembles tend to overtrain. In other words, the resubstitution error is typically smaller than the true generalization error. By making the resubstitution error even smaller, you are likely to make the ensemble accuracy worse instead of improving it. On the other hand, positive values of λ push the magnitude of the αt coefficients to 0. This often improves the generalization error. Of course, if you choose λ too large, all the optimal coefficients are 0, and the ensemble does not have any accuracy. Usually you can find an optimal range for λ in which the accuracy of the regularized ensemble is better or comparable to that of the full ensemble without regularization.

A nice feature of lasso regularization is its ability to drive the optimized coefficients precisely to 0. If a learner's weight αt is 0, this learner can be excluded from the regularized ensemble. In the end, you get an ensemble with improved accuracy and fewer learners.

#### Regularize a Regression Ensemble

This example uses data for predicting the insurance risk of a car based on its many attributes.

Load the `imports-85` data into the MATLAB workspace:

```load imports-85; ```

Look at a description of the data to find the categorical variables and predictor names:

```Description ```
```Description = 1985 Auto Imports Database from the UCI repository http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names Variables have been reordered to place variables with numeric values (referred to as "continuous" on the UCI site) to the left and categorical values to the right. Specifically, variables 1:16 are: symboling, normalized-losses, wheel-base, length, width, height, curb-weight, engine-size, bore, stroke, compression-ratio, horsepower, peak-rpm, city-mpg, highway-mpg, and price. Variables 17:26 are: make, fuel-type, aspiration, num-of-doors, body-style, drive-wheels, engine-location, engine-type, num-of-cylinders, and fuel-system. ```

The objective of this process is to predict the "symboling," the first variable in the data, from the other predictors. "symboling" is an integer from `-3` (good insurance risk) to `3` (poor insurance risk). You could use a classification ensemble to predict this risk instead of a regression ensemble. When you have a choice between regression and classification, you should try regression first.

Prepare the data for ensemble fitting:

```Y = X(:,1); X(:,1) = []; VarNames = {'normalized-losses' 'wheel-base' 'length' 'width' 'height' ... 'curb-weight' 'engine-size' 'bore' 'stroke' 'compression-ratio' ... 'horsepower' 'peak-rpm' 'city-mpg' 'highway-mpg' 'price' 'make' ... 'fuel-type' 'aspiration' 'num-of-doors' 'body-style' 'drive-wheels' ... 'engine-location' 'engine-type' 'num-of-cylinders' 'fuel-system'}; catidx = 16:25; % indices of categorical predictors ```

Create a regression ensemble from the data using 300 default trees:

```ls = fitensemble(X,Y,'LSBoost',300,'Tree','LearnRate',0.1,... 'PredictorNames',VarNames,'ResponseName','Symboling',... 'CategoricalPredictors',catidx) ```
```ls = classreg.learning.regr.RegressionEnsemble PredictorNames: {1x25 cell} ResponseName: 'Symboling' ResponseTransform: 'none' NumObservations: 205 NumTrained: 300 Method: 'LSBoost' LearnerNames: {'Tree'} ReasonForTermination: 'Terminated normally after completing the reques...' FitInfo: [300x1 double] FitInfoDescription: {2x1 cell} Regularization: [] ```

The final line, `Regularization`, is empty ([]). To regularize the ensemble, you have to use the `regularize` method.

```cv = crossval(ls,'KFold',5); figure; plot(kfoldLoss(cv,'Mode','Cumulative')); xlabel('Number of trees'); ylabel('Cross-validated MSE'); ylim([0.2,2]) ```

It appears you might obtain satisfactory performance from a smaller ensemble, perhaps one containing from 50 to 100 trees.

6.Call the `regularize` method to try to find trees that you can remove from the ensemble. By default, `regularize` examines 10 values of the lasso (`Lambda`) parameter spaced exponentially.

```ls = regularize(ls) ```
```ls = classreg.learning.regr.RegressionEnsemble PredictorNames: {1x25 cell} ResponseName: 'Symboling' ResponseTransform: 'none' NumObservations: 205 NumTrained: 300 Method: 'LSBoost' LearnerNames: {'Tree'} ReasonForTermination: 'Terminated normally after completing the reques...' FitInfo: [300x1 double] FitInfoDescription: {2x1 cell} Regularization: [1x1 struct] ```

The `Regularization` property is no longer empty.

Plot the resubstitution mean-squared error (MSE) and number of learners with nonzero weights against the lasso parameter. Separately plot the value at `Lambda = 0`. Use a logarithmic scale because the values of `Lambda` are exponentially spaced.

```figure; semilogx(ls.Regularization.Lambda,ls.Regularization.ResubstitutionMSE); line([1e-3 1e-3],[ls.Regularization.ResubstitutionMSE(1) ... ls.Regularization.ResubstitutionMSE(1)],... 'Marker','x','Markersize',12,'Color','b'); r0 = resubLoss(ls); line([ls.Regularization.Lambda(2) ls.Regularization.Lambda(end)],... [r0 r0],'Color','r','LineStyle','--'); xlabel('Lambda'); ylabel('Resubstitution MSE'); annotation('textbox',[0.5 0.22 0.5 0.05],'String','unregularized ensemble',... 'Color','r','FontSize',14,'LineStyle','none'); figure; loglog(ls.Regularization.Lambda,sum(ls.Regularization.TrainedWeights>0,1)); line([1e-3 1e-3],... [sum(ls.Regularization.TrainedWeights(:,1)>0) ... sum(ls.Regularization.TrainedWeights(:,1)>0)],... 'marker','x','markersize',12,'color','b'); line([ls.Regularization.Lambda(2) ls.Regularization.Lambda(end)],... [ls.NTrained ls.NTrained],... 'color','r','LineStyle','--'); xlabel('Lambda'); ylabel('Number of learners'); annotation('textbox',[0.3 0.8 0.5 0.05],'String','unregularized ensemble',... 'color','r','FontSize',14,'LineStyle','none'); ```

The resubstitution MSE values are likely to be overly optimistic. To obtain more reliable estimates of the error associated with various values of `Lambda`, cross validate the ensemble using `cvshrink`. Plot the resulting cross-validation loss (MSE) and number of learners against `Lambda`.

```rng(0,'Twister') % for reproducibility [mse,nlearn] = cvshrink(ls,'Lambda',ls.Regularization.Lambda,'KFold',5); figure; semilogx(ls.Regularization.Lambda,ls.Regularization.ResubstitutionMSE); hold; semilogx(ls.Regularization.Lambda,mse,'r--'); hold off; xlabel('Lambda'); ylabel('Mean squared error'); legend('resubstitution','cross-validation','Location','NW'); line([1e-3 1e-3],[ls.Regularization.ResubstitutionMSE(1) ... ls.Regularization.ResubstitutionMSE(1)],... 'Marker','x','Markersize',12,'Color','b'); line([1e-3 1e-3],[mse(1) mse(1)],'Marker','o',... 'Markersize',12,'Color','r','LineStyle','--'); figure; loglog(ls.Regularization.Lambda,sum(ls.Regularization.TrainedWeights>0,1)); hold; loglog(ls.Regularization.Lambda,nlearn,'r--'); hold off; xlabel('Lambda'); ylabel('Number of learners'); legend('resubstitution','cross-validation','Location','NE'); line([1e-3 1e-3],... [sum(ls.Regularization.TrainedWeights(:,1)>0) ... sum(ls.Regularization.TrainedWeights(:,1)>0)],... 'Marker','x','Markersize',12,'Color','b'); line([1e-3 1e-3],[nlearn(1) nlearn(1)],'marker','o',... 'Markersize',12,'Color','r','LineStyle','--'); ```
```Warning: Some folds do not have any trained weak learners. Warning: Some folds do not have any trained weak learners. Warning: Some folds do not have any trained weak learners. Current plot held Current plot held ```

Examining the cross-validated error shows that the cross-validation MSE is almost flat for `Lambda` up to a bit over `1e-2`.

Examine `ls.Regularization.Lambda` to find the highest value that gives MSE in the flat region (up to a bit over `1e-2`):

```jj = 1:length(ls.Regularization.Lambda); [jj;ls.Regularization.Lambda] ```
```ans = Columns 1 through 7 1.0000 2.0000 3.0000 4.0000 5.0000 6.0000 7.0000 0 0.0014 0.0033 0.0077 0.0183 0.0435 0.1031 Columns 8 through 10 8.0000 9.0000 10.0000 0.2446 0.5800 1.3754 ```

Element `5` of `ls.Regularization.Lambda` has value `0.0183`, the largest in the flat range.

Reduce the ensemble size using the `shrink` method. `shrink` returns a compact ensemble with no training data. The generalization error for the new compact ensemble was already estimated by cross validation in `mse(5)`.

```cmp = shrink(ls,'weightcolumn',5) ```
```cmp = classreg.learning.regr.CompactRegressionEnsemble PredictorNames: {1x25 cell} ResponseName: 'Symboling' ResponseTransform: 'none' NumTrained: 15 ```

There are only 15 trees in the new ensemble, notably reduced from the 300 in `ls`.

Compare the sizes of the ensembles:

```sz(1) = whos('cmp'); sz(2) = whos('ls'); [sz(1).bytes sz(2).bytes] ```
```ans = 84544 1775699 ```

The reduced ensemble is about 5% the size of the original.

Compare the MSE of the reduced ensemble to that of the original ensemble:

```figure; plot(kfoldLoss(cv,'mode','cumulative')); hold on plot(cmp.NTrained,mse(5),'ro','MarkerSize',12); xlabel('Number of trees'); ylabel('Cross-validated MSE'); legend('unregularized ensemble','regularized ensemble',... 'Location','NE'); hold off ```

The reduced ensemble gives low loss while using many fewer trees.

### Tune RobustBoost

The `RobustBoost` algorithm can make good classification predictions even when the training data has noise. However, the default `RobustBoost` parameters can produce an ensemble that does not predict well. This example shows one way of tuning the parameters for better predictive accuracy.

Note that `RobustBoost` requires an Optimization Toolbox™ license.

Generate data with label noise. This example has twenty uniform random numbers per observation, and classifies the observation as `1` if the sum of the first five numbers exceeds 2.5 (so is larger than average), and `0` otherwise:

```rng(0,'twister') % for reproducibility Xtrain = rand(2000,20); Ytrain = sum(Xtrain(:,1:5),2) > 2.5; ```

To add noise, randomly switch 10% of the classifications:

```idx = randsample(2000,200); Ytrain(idx) = ~Ytrain(idx); ```

Create an ensemble with `AdaBoostM1` for comparison purposes:

```ada = fitensemble(Xtrain,Ytrain,'AdaBoostM1',... 300,'Tree','LearnRate',0.1); ```

Create an ensemble with `RobustBoost`. Because the data has 10% incorrect classification, perhaps an error goal of 15% is reasonable.

```rb1 = fitensemble(Xtrain,Ytrain,'RobustBoost',300,... 'Tree','RobustErrorGoal',0.15,'RobustMaxMargin',1); ```

Note that if you set the error goal to a high enough value, then the software returns an error.

Create an ensemble with very optimistic error goal, `0.01`:

```rb2 = fitensemble(Xtrain,Ytrain,'RobustBoost',300,... 'Tree','RobustErrorGoal',0.01); ```

Compare the resubstitution error of the four ensembles:

```figure plot(resubLoss(rb1,'Mode','Cumulative')); hold on plot(resubLoss(rb2,'Mode','Cumulative'),'r--'); plot(resubLoss(ada,'Mode','Cumulative'),'g.'); hold off; xlabel('Number of trees'); ylabel('Resubstitution error'); legend('ErrorGoal=0.15','ErrorGoal=0.01',... 'AdaBoostM1','Location','NE'); ```

All the `RobustBoost` curves show lower resubstitution error than the `AdaBoostM1` curve. The error goal of `0.15` curve shows the lowest resubstitution error over most of the range.

```Xtest = rand(2000,20); Ytest = sum(Xtest(:,1:5),2) > 2.5; idx = randsample(2000,200); Ytest(idx) = ~Ytest(idx); figure; plot(loss(rb1,Xtest,Ytest,'Mode','Cumulative')); hold on plot(loss(rb2,Xtest,Ytest,'Mode','Cumulative'),'r--'); plot(loss(ada,Xtest,Ytest,'Mode','Cumulative'),'g.'); hold off; xlabel('Number of trees'); ylabel('Test error'); legend('ErrorGoal=0.15','ErrorGoal=0.01',... 'AdaBoostM1','Location','NE'); ```

The error curve for error goal 0.15 is lowest (best) in the plotted range. `AdaBoostM1` has higher error than the curve for error goal 0.15. The curve for the too-optimistic error goal 0.01 remains substantially higher (worse) than the other algorithms for most of the plotted range.

### Random Subspace Classification

This example shows how to use a random subspace ensemble to increase the accuracy of classification. It also shows how to use cross validation to determine good parameters for both the weak learner template and the ensemble.

Load the `ionosphere` data. This data has 351 binary responses to 34 predictors.

```load ionosphere; [N,D] = size(X) resp = unique(Y) ```
```N = 351 D = 34 resp = 'b' 'g' ```

Choose the number of nearest neighbors

Find a good choice for `k`, the number of nearest neighbors in the classifier, by cross validation. Choose the number of neighbors approximately evenly spaced on a logarithmic scale.

```rng(8000,'twister') % for reproducibility K = round(logspace(0,log10(N),10)); % number of neighbors cvloss = zeros(numel(K),1); for k=1:numel(K) knn = fitcknn(X,Y,... 'NumNeighbors',K(k),'CrossVal','On'); cvloss(k) = kfoldLoss(knn); end figure; % Plot the accuracy versus k semilogx(K,cvloss); xlabel('Number of nearest neighbors'); ylabel('10 fold classification error'); title('k-NN classification'); ```

The lowest cross-validation error occurs for `k = 2`.

Create the ensembles

Create ensembles for `2`-nearest neighbor classification with various numbers of dimensions, and examine the cross-validated loss of the resulting ensembles.

This step takes a long time. To keep track of the progress, print a message as each dimension finishes.

```NPredToSample = round(linspace(1,D,10)); % linear spacing of dimensions cvloss = zeros(numel(NPredToSample),1); learner = templateKNN('NumNeighbors',2); for npred=1:numel(NPredToSample) subspace = fitensemble(X,Y,'Subspace',100,learner,... 'NPredToSample',NPredToSample(npred),'CrossVal','On'); cvloss(npred) = kfoldLoss(subspace); fprintf('Random Subspace %i done.\n',npred); end figure; % plot the accuracy versus dimension plot(NPredToSample,cvloss); xlabel('Number of predictors selected at random'); ylabel('10 fold classification error'); title('k-NN classification with Random Subspace'); ```
```Random Subspace 1 done. Random Subspace 2 done. Random Subspace 3 done. Random Subspace 4 done. Random Subspace 5 done. Random Subspace 6 done. Random Subspace 7 done. Random Subspace 8 done. Random Subspace 9 done. Random Subspace 10 done. ```

The ensembles that use five and eight predictors per learner have the lowest cross-validated error. The error rate for these ensembles is about 0.06, while the other ensembles have cross-validated error rates that are approximately 0.1 or more.

Find a good ensemble size

Find the smallest number of learners in the ensemble that still give good classification.

```ens = fitensemble(X,Y,'Subspace',100,learner,... 'NPredToSample',5,'CrossVal','on'); figure; % Plot the accuracy versus number in ensemble plot(kfoldLoss(ens,'Mode','Cumulative')) xlabel('Number of learners in ensemble'); ylabel('10 fold classification error'); title('k-NN classification with Random Subspace'); ```

There seems to be no advantage in an ensemble with more than 50 or so learners. It is possible that 25 learners gives good predictions.

Create a final ensemble

Construct a final ensemble with 50 learners. Compact the ensemble and see if the compacted version saves an appreciable amount of memory.

```ens = fitensemble(X,Y,'Subspace',50,learner,... 'NPredToSample',5); cens = compact(ens); s1 = whos('ens'); s2 = whos('cens'); [s1.bytes s2.bytes] % si.bytes = size in bytes ```
```ans = 1756230 1525678 ```

The compact ensemble is about 10% smaller than the full ensemble. Both give the same predictions.

### TreeBagger Examples

`TreeBagger` ensembles have more functionality than those constructed with `fitensemble`; see TreeBagger Features Not in fitensemble. Also, some property and method names differ from their `fitensemble` counterparts. This section contains examples of workflow for regression and classification that use this extra `TreeBagger` functionality.

#### Regression of Insurance Risk Rating for Car Imports Using TreeBagger

In this example, use a database of 1985 car imports with 205 observations, 25 predictors, and 1 response, which is insurance risk rating, or "symboling." The first 15 variables are numeric and the last 10 are categorical. The symboling index takes integer values from -3 to 3.

Load the data set and split it into predictor and response arrays.

```load imports-85; Y = X(:,1); X = X(:,2:end); isCategorical = [zeros(15,1);ones(size(X,2)-15,1)]; % Categorical variable flag ```

Because bagging uses randomized data drawings, its exact outcome depends on the initial random seed. To reproduce the results in this example, use the random stream settings.

```rng(1945,'twister') ```

Finding the Optimal Leaf Size

For regression, the general rule is to the set leaf size to 5 and select one third of the input features for decision splits at random. In the following step, verify the optimal leaf size by comparing mean squared errors obtained by regression for various leaf sizes. `oobError` computes MSE versus the number of grown trees. You must set `OOBPred` to `'On'` to obtain out-of-bag predictions later.

```leaf = [5 10 20 50 100]; col = 'rbcmy'; figure for i=1:length(leaf) b = TreeBagger(50,X,Y,'Method','R','OOBPred','On',... 'CategoricalPredictors',find(isCategorical == 1),'MinLeaf',leaf(i)); plot(oobError(b),col(i)); hold on; end xlabel 'Number of Grown Trees'; ylabel 'Mean Squared Error' ; legend({'5' '10' '20' '50' '100'},'Location','NorthEast'); hold off; ```

The red curve (leaf size 5) yields the lowest MSE values.

Estimating Feature Importance

In practical applications, you typically grow ensembles with hundreds of trees. For example, the previous code block uses 50 trees for faster processing. Now that you have estimated the optimal leaf size, grow a larger ensemble with 100 trees and use it to estimate feature importance.

```b = TreeBagger(100,X,Y,'Method','R','OOBVarImp','On',... 'CategoricalPredictors',find(isCategorical == 1),... 'MinLeaf',5); ```

Inspect the error curve again to make sure nothing went wrong during training.

```figure plot(oobError(b)); xlabel 'Number of Grown Trees'; ylabel 'Out-of-Bag Mean Squared Error'; ```

Prediction ability should depend more on important features than unimportant features. You can use this idea to measure feature importance.

For each feature, permute the values of this feature across every observation in the data set and measure how much worse the MSE becomes after the permutation. You can repeat this for each feature.

Using the following code, plot the increase in MSE due to permuting out-of-bag observations across each input variable. The `OOBPermutedVarDeltaError` array stores the increase in MSE averaged over all trees in the ensemble and divided by the standard deviation taken over the trees, for each variable. The larger this value, the more important the variable. Imposing an arbitrary cutoff at 0.7, you can select the five most important features.

```figure bar(b.OOBPermutedVarDeltaError); xlabel 'Feature Number' ; ylabel 'Out-of-Bag Feature Importance'; idxvar = find(b.OOBPermutedVarDeltaError>0.7) idxCategorical = find(isCategorical(idxvar)==1); ```
```idxvar = 1 16 19 ```

The `OOBIndices` property of `TreeBagger` tracks which observations are out of bag for what trees. Using this property, you can monitor the fraction of observations in the training data that are in bag for all trees. The curve starts at approximately 2/3, which is the fraction of unique observations selected by one bootstrap replica, and goes down to 0 at approximately 10 trees.

```finbag = zeros(1,b.NTrees); for t=1:b.NTrees finbag(t) = sum(all(~b.OOBIndices(:,1:t),2)); end finbag = finbag / size(X,1); figure plot(finbag); xlabel 'Number of Grown Trees'; ylabel 'Fraction of In-Bag Observations'; ```

Growing Trees on a Reduced Set of Features

Using just the five most powerful features, determine if it is possible to obtain a similar predictive power. To begin, grow 100 trees on these features only. The first three of the five selected features are numeric and the last two are categorical.

```b5v = TreeBagger(100,X(:,idxvar),Y,'Method','R',... 'OOBVarImp','On','CategoricalPredictors',idxCategorical,... 'MinLeaf',5); figure plot(oobError(b5v)); xlabel 'Number of Grown Trees'; ylabel 'Out-of-Bag Mean Squared Error'; figure bar(b5v.OOBPermutedVarDeltaError); xlabel 'Feature Index'; ylabel 'Out-of-Bag Feature Importance'; ```

These five most powerful features give the same MSE as the full set, and the ensemble trained on the reduced set ranks these features similarly to each other. If you remove features 1 and 2 from the reduced set, then the predictive power of the algorithm might not decrease significantly.

Finding Outliers

To find outliers in the training data, compute the proximity matrix using `fillProximities`.

```b5v = fillProximities(b5v); ```

The method normalizes this measure by subtracting the mean outlier measure for the entire sample. Then it takes the magnitude of this difference and divides the result by the median absolute deviation for the entire sample.

```figure histogram(b5v.OutlierMeasure); xlabel 'Outlier Measure'; ylabel 'Number of Observations'; ```

Discovering Clusters in the Data

By applying multidimensional scaling to the computed matrix of proximities, you can inspect the structure of the input data and look for possible clusters of observations. The `mdsProx` method returns scaled coordinates and eigenvalues for the computed proximity matrix. If you run it with the `Colors` name-value-pair argument, then this method creates a scatter plot of two scaled coordinates.

```figure(8); [~,e] = mdsProx(b5v,'Colors','K'); xlabel 'First Scaled Coordinate'; ylabel 'Second Scaled Coordinate'; ```

Assess the relative importance of the scaled axes by plotting the first 20 eigenvalues.

```figure bar(e(1:20)); xlabel 'Scaled Coordinate Index'; ylabel 'Eigenvalue'; ```

Saving the Ensemble Configuration for Future Use

To use the trained ensemble for predicting the response on unseen data, store the ensemble to disk and retrieve it later. If you do not want to compute predictions for out-of-bag data or reuse training data in any other way, there is no need to store the ensemble object itself. Saving the compact version of the ensemble is enough in this case. Extract the compact object from the ensemble.

```c = compact(b5v) ```
```c = CompactTreeBagger Ensemble with 100 bagged decision trees: Method: regression Nvars: 3 ```

You can save the resulting `CompactTreeBagger` model in a `*.mat` file.

#### Classifying Radar Returns for Ionosphere Data Using TreeBagger

You can also use ensembles of decision trees for classification. For this example, use ionosphere data with 351 observations and 34 real-valued predictors. The response variable is categorical with two levels:

• 'g' represents good radar returns.

The goal is to predict good or bad returns using a set of 34 measurements.

Fix the initial random seed, grow 50 trees, inspect how the ensemble error changes with accumulation of trees, and estimate feature importance. For classification, it is best to set the minimal leaf size to 1 and select the square root of the total number of features for each decision split at random. These settings are defaults for `TreeBagger` used for classification.

```load ionosphere; rng(1945,'twister') b = TreeBagger(50,X,Y,'OOBVarImp','On'); figure plot(oobError(b)); xlabel('Number of Grown Trees'); ylabel('Out-of-Bag Classification Error'); ```

The method trains ensembles with few trees on observations that are in bag for all trees. For such observations, it is impossible to compute the true out-of-bag prediction, and `TreeBagger` returns the most probable class for classification and the sample mean for regression. You can change the default value returned for in-bag observations using the `DefaultYfit` property. If you set the default value to an empty string for classification, the method excludes in-bag observations from computation of the out-of-bag error. In this case, the curve is more variable when the number of trees is small, either because some observations are never out of bag (and are therefore excluded) or because their predictions are based on few trees.

```b.DefaultYfit = ''; figure plot(oobError(b)); xlabel('Number of Grown Trees'); ylabel('Out-of-Bag Error Excluding In-Bag Observations'); ```

The `OOBIndices` property of `TreeBagger` tracks which observations are out of bag for what trees. Using this property, you can monitor the fraction of observations in the training data that are in bag for all trees. The curve starts at approximately 2/3, which is the fraction of unique observations selected by one bootstrap replica, and goes down to 0 at approximately 10 trees.

```finbag = zeros(1,b.NTrees); for t=1:b.NTrees finbag(t) = sum(all(~b.OOBIndices(:,1:t),2)); end finbag = finbag / size(X,1); figure plot(finbag); xlabel('Number of Grown Trees'); ylabel('Fraction of In-Bag Observations'); ```

Estimate feature importance.

```figure bar(b.OOBPermutedVarDeltaError); xlabel('Feature Index'); ylabel('Out-of-Bag Feature Importance'); idxvar = find(b.OOBPermutedVarDeltaError>0.75) ```
```idxvar = 3 5 8 27 ```

Having selected the five most important features, grow a larger ensemble on the reduced feature set. Save time by not permuting out-of-bag observations to obtain new estimates of feature importance for the reduced feature set (set `OOBVarImp` to `'off'`). You would still be interested in obtaining out-of-bag estimates of classification error (set `OOBPred` to `'on'`).

```b5v = TreeBagger(100,X(:,idxvar),Y,'OOBVarImp','off','OOBPred','on'); figure plot(oobError(b5v)); xlabel('Number of Grown Trees'); ylabel('Out-of-Bag Classification Error'); ```

For classification ensembles, in addition to classification error (fraction of misclassified observations), you can also monitor the average classification margin. For each observation, the margin is defined as the difference between the score for the true class and the maximal score for other classes predicted by this tree. The cumulative classification margin uses the scores averaged over all trees and the mean cumulative classification margin is the cumulative margin averaged over all observations. The `oobMeanMargin` method with the `'mode'` argument set to `'cumulative'` (default) shows how the mean cumulative margin changes as the ensemble grows: every new element in the returned array represents the cumulative margin obtained by including a new tree in the ensemble. If training is successful, you would expect to see a gradual increase in the mean classification margin.

The method trains ensembles with few trees on observations that are in bag for all trees. For such observations, it is impossible to compute the true out-of-bag prediction, and `TreeBagger` returns the most probable class for classification and the sample mean for regression.

For decision trees, a classification score is the probability of observing an instance of this class in this tree leaf. For example, if the leaf of a grown decision tree has five `'good'` and three `'bad'` training observations in it, the scores returned by this decision tree for any observation fallen on this leaf are 5/8 for the `'good'` class and 3/8 for the `'bad'` class. These probabilities are called `'scores'` for consistency with other classifiers that might not have an obvious interpretation for numeric values of returned predictions.

```figure plot(oobMeanMargin(b5v)); xlabel('Number of Grown Trees'); ylabel('Out-of-Bag Mean Classification Margin'); ```

Compute the matrix of proximities and examine the distribution of outlier measures. Unlike regression, outlier measures for classification ensembles are computed within each class separately.

```b5v = fillProximities(b5v); figure histogram(b5v.OutlierMeasure); xlabel('Outlier Measure'); ylabel('Number of Observations'); ```

Slightly more than half of the extreme outliers are labeled `'bad'`.

```extremeOutliers = b5v.Y(b5v.OutlierMeasure>40) percentBad = 100*sum(strcmp(extremeOutliers,'b'))/numel(extremeOutliers) ```
```extremeOutliers = 'g' 'g' 'g' 'g' 'g' 'g' percentBad = 0 ```

As for regression, you can plot scaled coordinates, displaying the two classes in different colors using the 'Colors' name-value pair argument of `mdsProx`. This argument takes a string in which every character represents a color. The software does not rank class names. Therefore, it is best practice to determine the position of the classes in the `ClassNames` property of the ensemble.

```gPosition = find(strcmp('g',b5v.ClassNames)) ```
```gPosition = 2 ```

The `'bad'` class is first and the `'good'` class is second. Display scaled coordinates using red for the `'bad'` class and blue for the `'good'` class observations.

```figure [s,e] = mdsProx(b5v,'Colors','rb'); xlabel('First Scaled Coordinate'); ylabel('Second Scaled Coordinate'); ```

Plot the first 20 eigenvalues obtained by scaling. The first eigenvalue clearly dominates and the first scaled coordinate is most important.

```figure bar(e(1:20)); xlabel('Scaled Coordinate Index'); ylabel('Eigenvalue'); ```

Another way of exploring the performance of a classification ensemble is to plot its Receiver Operating Characteristic (ROC) curve or another performance curve suitable for the current problem. Obtain predictions for out-of-bag observations. For a classification ensemble, the `oobPredict` method returns a cell array of classification labels as the first output argument and a numeric array of scores as the second output argument. The returned array of scores has two columns, one for each class. In this case, the first column is for the `'bad'` class and the second column is for the `'good'` class. One column in the score matrix is redundant because the scores represent class probabilities in tree leaves and by definition add up to 1.

```[Yfit,Sfit] = oobPredict(b5v); ```

Use `perfcurve` to compute a performance curve. By default, `perfcurve` returns the standard ROC curve, which is the true positive rate versus the false positive rate. `perfcurve` requires true class labels, scores, and the positive class label for input. In this case, choose the `'good'` class as positive.

```[fpr,tpr] = perfcurve(b5v.Y,Sfit(:,gPosition),'g'); figure plot(fpr,tpr); xlabel('False Positive Rate'); ylabel('True Positive Rate'); ```

Instead of the standard ROC curve, you might want to plot, for example, ensemble accuracy versus threshold on the score for the `'good'` class. The `ycrit` input argument of `perfcurve` lets you specify the criterion for the `y`-axis, and the third output argument of `perfcurve` returns an array of thresholds for the positive class score. Accuracy is the fraction of correctly classified observations, or equivalently, 1 minus the classification error.

```[fpr,accu,thre] = perfcurve(b5v.Y,Sfit(:,gPosition),'g','YCrit','Accu'); figure(20); plot(thre,accu); xlabel('Threshold for ''good'' Returns'); ylabel('Classification Accuracy'); ```

The curve shows a flat region indicating that any threshold from 0.2 to 0.6 is a reasonable choice. By default, the `perfcurve` assigns classification labels using 0.5 as the boundary between the two classes. You can find exactly what accuracy this corresponds to.

```accu(abs(thre-0.5)<eps) ```
```ans = Empty matrix: 0-by-1 ```

The maximal accuracy is a little higher than the default one.

```[maxaccu,iaccu] = max(accu) ```
```maxaccu = 0.9316 iaccu = 104 ```

The optimal threshold is therefore.

```thre(iaccu) ```
```ans = 0.4570 ```

### Ensemble Algorithms

`AdaBoostM1` is a very popular boosting algorithm for binary classification. The algorithm trains learners sequentially. For every learner with index t, `AdaBoostM1` computes the weighted classification error

${\epsilon }_{t}=\sum _{n=1}^{N}{d}_{n}^{\left(t\right)}\mathbb{I}\left({y}_{n}\ne {h}_{t}\left({x}_{n}\right)\right),$

where

• xn is a vector of predictor values for observation n.

• yn is the true class label.

• ht is the prediction of learner (hypothesis) with index t.

• $\mathbb{I}$ is the indicator function.

• ${d}_{n}^{\left(t\right)}$ is the weight of observation n at step t.

`AdaBoostM1` then increases weights for observations misclassified by learner t and reduces weights for observations correctly classified by learner t. The next learner t + 1 is then trained on the data with updated weights ${d}_{n}^{\left(t+1\right)}$.

After training finishes, `AdaBoostM1` computes prediction for new data using

$f\left(x\right)=\sum _{t=1}^{T}{\alpha }_{t}{h}_{t}\left(x\right),$

where

${\alpha }_{t}=\frac{1}{2}\mathrm{log}\frac{1-{\epsilon }_{t}}{{\epsilon }_{t}}$

are weights of the weak hypotheses in the ensemble.

Training by `AdaBoostM1` can be viewed as stagewise minimization of the exponential loss

$\sum _{n=1}^{N}{w}_{n}\mathrm{exp}\left(-{y}_{n}f\left({x}_{n}\right)\right),$

where

• yn ∊ {–1,+1} is the true class label.

• wn are observation weights normalized to add up to 1.

• f(xn) ∊ (–∞,+∞) is the predicted classification score.

The observation weights wn are the original observation weights you passed to `fitensemble`.

The second output from the `predict` method of an `AdaBoostM1` classification ensemble is an N-by-2 matrix of classification scores for the two classes and N observations. The second column in this matrix is always equal to minus the first column. `predict` returns two scores to be consistent with multiclass models, though this is redundant because the second column is always the negative of the first.

Most often `AdaBoostM1` is used with decision stumps (default) or shallow trees. If boosted stumps give poor performance, try setting the minimal parent node size to one quarter of the training data.

By default, the learning rate for boosting algorithms is `1`. If you set the learning rate to a lower number, the ensemble learns at a slower rate, but can converge to a better solution. `0.1` is a popular choice for the learning rate. Learning at a rate less than `1` is often called "shrinkage".

For examples using `AdaBoostM1`, see Tune RobustBoost.

For references related to `AdaBoostM1`, see Freund and Schapire [16], Schapire et al. [26], Friedman, Hastie, and Tibshirani [18], and Friedman [17].

`AdaBoostM2` is an extension of `AdaBoostM1` for multiple classes. Instead of weighted classification error, `AdaBoostM2` uses weighted pseudo-loss for N observations and K classes

${\epsilon }_{t}=\frac{1}{2}\sum _{n=1}^{N}\sum _{k\ne {y}_{n}}{d}_{n,k}^{\left(t\right)}\left(1-{h}_{t}\left({x}_{n},{y}_{n}\right)+{h}_{t}\left({x}_{n},k\right)\right),$

where

• ht(xn,k) is the confidence of prediction by learner at step t into class k ranging from 0 (not at all confident) to 1 (highly confident).

• ${d}_{n,k}^{\left(t\right)}$ are observation weights at step t for class k.

• yn is the true class label taking one of the K values.

• The second sum is over all classes other than the true class yn.

Interpreting the pseudo-loss is harder than classification error, but the idea is the same. Pseudo-loss can be used as a measure of the classification accuracy from any learner in an ensemble. Pseudo-loss typically exhibits the same behavior as a weighted classification error for `AdaBoostM1`: the first few learners in a boosted ensemble give low pseudo-loss values. After the first few training steps, the ensemble begins to learn at a slower pace, and the pseudo-loss value approaches 0.5 from below.

For examples using `AdaBoostM2`, see Train a Classification Ensemble.

For references related to `AdaBoostM2`, see Freund and Schapire [16].

#### Bag

Bagging, which stands for "bootstrap aggregation," is a type of ensemble learning. To bag a weak learner such as a decision tree on a dataset, generate many bootstrap replicas of this dataset and grow decision trees on these replicas. Obtain each bootstrap replica by randomly selecting `N` observations out of `N` with replacement, where `N` is the dataset size. To find the predicted response of a trained ensemble, take an average over predictions from individual trees.

Bagged decision trees were introduced in MATLAB R2009a as `TreeBagger`. The `fitensemble` function lets you bag in a manner consistent with boosting. An ensemble of bagged trees, either `ClassificationBaggedEnsemble` or `RegressionBaggedEnsemble`, returned by `fitensemble` offers almost the same functionally as `TreeBagger`. Discrepancies between `TreeBagger` and the new framework are described in detail in TreeBagger Features Not in fitensemble.

Bagging works by training learners on resampled versions of the data. This resampling is usually done by bootstrapping observations, that is, selecting N out of N observations with replacement for every new learner. In addition, every tree in the ensemble can randomly select predictors for decision splits—a technique known to improve the accuracy of bagged trees.

By default, the minimal leaf sizes for bagged trees are set to `1` for classification and `5` for regression. Trees grown with the default leaf size are usually very deep. These settings are close to optimal for the predictive power of an ensemble. Often you can grow trees with larger leaves without losing predictive power. Doing so reduces training and prediction time, as well as memory usage for the trained ensemble.

Another important parameter is the number of predictors selected at random for every decision split. This random selection is made for every split, and every deep tree involves many splits. By default, this parameter is set to a square root of the number of predictors for classification, and one third of predictors for regression.

Several features of bagged decision trees make them a unique algorithm. Drawing `N` out of `N` observations with replacement omits on average 37% of observations for each decision tree. These are "out-of-bag" observations. You can use them to estimate the predictive power and feature importance. For each observation, you can estimate the out-of-bag prediction by averaging over predictions from all trees in the ensemble for which this observation is out of bag. You can then compare the computed prediction against the observed response for this observation. By comparing the out-of-bag predicted responses against the observed responses for all observations used for training, you can estimate the average out-of-bag error. This out-of-bag average is an unbiased estimator of the true ensemble error. You can also obtain out-of-bag estimates of feature importance by randomly permuting out-of-bag data across one variable or column at a time and estimating the increase in the out-of-bag error due to this permutation. The larger the increase, the more important the feature. Thus, you need not supply test data for bagged ensembles because you obtain reliable estimates of the predictive power and feature importance in the process of training, which is an attractive feature of bagging.

Another attractive feature of bagged decision trees is the proximity matrix. Every time two observations land on the same leaf of a tree, their proximity increases by 1. For normalization, sum these proximities over all trees in the ensemble and divide by the number of trees. The resulting matrix is symmetric with diagonal elements equal to 1 and off-diagonal elements ranging from 0 to 1. You can use this matrix for finding outlier observations and discovering clusters in the data through multidimensional scaling.

For examples using bagging, see:

For references related to bagging, see Breiman [7], [8], and [9].

Comparison of TreeBagger and Bagged Ensembles.  `fitensemble` produces bagged ensembles that have most, but not all, of the functionality of `TreeBagger` objects. Additionally, some functionality has different names in the new bagged ensembles.

TreeBagger Features Not in fitensemble

FeatureTreeBagger PropertyTreeBagger Method
Computation of proximity matrix`Proximity``fillProximities`, `mdsProx`
Computation of outliers`OutlierMeasure`N/A
Out-of-bag estimates of predictor importance`OOBPermutedVarDeltaError`, `OOBPermutedVarDeltaMeanMargin`, `OOBPermutedVarCountRaiseMargin`N/A
Merging two ensembles trained separatelyN/A`append`
Parallel computation for creating ensembleSet the `UseParallel` name-value pair to `true`N/A

Differing Names Between TreeBagger and Bagged Ensembles

FeatureTreeBaggerBagged Ensembles
Split criterion contributions for each predictor`DeltaCritDecisionSplit` propertyFirst output of `predictorImportance` (classification) or `predictorImportance` (regression)
Predictor associations`VarAssoc` propertySecond output of `predictorImportance` (classification) or `predictorImportance` (regression)
Error (misclassification probability or mean-squared error)`error` and `oobError` methods`loss` and `oobLoss` methods (classification); `loss` and `oobLoss` methods (regression)
Train additional trees and add to ensemble`growTrees` method`resume` method (classification); `resume` method (regression)
Mean classification margin per tree`meanMargin` and `oobMeanMargin` methods`edge` and `oobEdge` methods (classification)

In addition, two important changes were made to training and prediction for bagged classification ensembles:

• If you pass a misclassification cost matrix to `TreeBagger`, it passes this matrix along to the trees. If you pass a misclassification cost matrix to `fitensemble`, it uses this matrix to adjust the class prior probabilities. `fitensemble` then passes the adjusted prior probabilities and the default cost matrix to the trees. The default cost matrix is `ones(K)-eye(K)` for `K` classes.

• Unlike the `loss` and `edge` methods in the new framework, the `TreeBagger` `error` and `meanMargin` methods do not normalize input observation weights of the prior probabilities in the respective class.

#### GentleBoost

`GentleBoost` (also known as Gentle AdaBoost) combines features of `AdaBoostM1` and `LogitBoost`. Like `AdaBoostM1`, `GentleBoost` minimizes the exponential loss. But its numeric optimization is set up differently. Like `LogitBoost`, every weak learner fits a regression model to response values yn ∊ {–1,+1}. This makes `GentleBoost` another good candidate for binary classification of data with multilevel categorical predictors.

`fitensemble` computes and stores the mean-squared error in the `FitInfo` property of the ensemble object. The mean-squared error is

$\sum _{n=1}^{N}{d}_{n}^{\left(t\right)}{\left({\stackrel{˜}{y}}_{n}-{h}_{t}\left({x}_{n}\right)\right)}^{2},$

where

• ${d}_{n}^{\left(t\right)}$ are observation weights at step t (the weights add up to 1).

• ht(xn) are predictions of the regression model ht fitted to response values yn.

As the strength of individual learners weakens, the weighted mean-squared error approaches 1.

For examples using `GentleBoost`, see Example: Unequal Classification Costs and Classification with Many Categorical Levels.

For references related to `GentleBoost`, see Friedman, Hastie, and Tibshirani [18].

#### LogitBoost

`LogitBoost` is another popular algorithm for binary classification. `LogitBoost` works similarly to `AdaBoostM1`, except it minimizes binomial deviance

$\sum _{n=1}^{N}{w}_{n}\mathrm{log}\left(1+\mathrm{exp}\left(-2{y}_{n}f\left({x}_{n}\right)\right)\right),$

where

• yn ∊ {–1,+1} is the true class label.

• wn are observation weights normalized to add up to 1.

• f(xn) ∊ (–∞,+∞) is the predicted classification score.

Binomial deviance assigns less weight to badly misclassified observations (observations with large negative values of ynf(xn)). `LogitBoost` can give better average accuracy than `AdaBoostM1` for data with poorly separable classes.

Learner t in a `LogitBoost` ensemble fits a regression model to response values

${\stackrel{˜}{y}}_{n}=\frac{{y}_{n}^{*}-{p}_{t}\left({x}_{n}\right)}{{p}_{t}\left({x}_{n}\right)\left(1-{p}_{t}\left({x}_{n}\right)\right)},$

where

• y*n ∊ {0,+1} are relabeled classes (0 instead of –1).

• pt(xn) is the current ensemble estimate of the probability for observation xn to be of class 1.

Fitting a regression model at each boosting step turns into a great computational advantage for data with multilevel categorical predictors. Take a categorical predictor with L levels. To find the optimal decision split on such a predictor, classification tree needs to consider 2L–1 – 1 splits. A regression tree needs to consider only L – 1 splits, so the processing time can be much shorter. `LogitBoost` is recommended for categorical predictors with many levels.

`fitensemble` computes and stores the mean-squared error in the `FitInfo` property of the ensemble object. The mean-squared error is

$\sum _{n=1}^{N}{d}_{n}^{\left(t\right)}{\left({\stackrel{˜}{y}}_{n}-{h}_{t}\left({x}_{n}\right)\right)}^{2},$

where

• ${d}_{n}^{\left(t\right)}$ are observation weights at step t (the weights add up to 1).

• ht(xn) are predictions of the regression model ht fitted to response values ${\stackrel{˜}{y}}_{n}$.

Values yn can range from –∞ to +∞, so the mean-squared error does not have well-defined bounds.

For examples using `LogitBoost`, see Classification with Many Categorical Levels.

For references related to `LogitBoost`, see Friedman, Hastie, and Tibshirani [18].

#### LPBoost

`LPBoost` (linear programming boost), like `TotalBoost`, performs multiclass classification by attempting to maximize the minimal margin in the training set. This attempt uses optimization algorithms, namely linear programming for `LPBoost`. So you need an Optimization Toolbox license to use `LPBoost` or `TotalBoost`.

The margin of a classification is the difference between the predicted soft classification score for the true class, and the largest score for the false classes. For trees, the score of a classification of a leaf node is the posterior probability of the classification at that node. The posterior probability of the classification at a node is the number of training sequences that lead to that node with the classification, divided by the number of training sequences that lead to that node. For more information, see Definitions in `margin`.

Why maximize the minimal margin? For one thing, the generalization error (the error on new data) is the probability of obtaining a negative margin. Schapire and Singer [27] establish this inequality on the probability of obtaining a negative margin:

${P}_{\text{test}}\left(m\le 0\right)\le {P}_{\text{train}}\left(m\le \theta \right)+O\left(\frac{1}{\sqrt{N}}\sqrt{\frac{V{\mathrm{log}}^{2}\left(N/V\right)}{{\theta }^{2}}+\mathrm{log}\left(1/\delta \right)}\right).$

Here m is the margin, θ is any positive number, V is the Vapnik-Chervonenkis dimension of the classifier space, N is the size of the training set, and δ is a small positive number. The inequality holds with probability 1–δ over many i.i.d. training and test sets. This inequality says: To obtain a low generalization error, minimize the number of observations below margin θ in the training set.

`LPBoost` iteratively maximizes the minimal margin through a sequence of linear programming problems. Equivalently, by duality, `LPBoost` minimizes the maximal edge, where edge is the weighted mean margin (see Definitions). At each iteration, there are more constraints in the problem. So, for large problems, the optimization problem becomes increasingly constrained, and slow to solve.

`LPBoost` typically creates ensembles with many learners having weights that are orders of magnitude smaller than those of other learners. Therefore, to better enable you to remove the unimportant ensemble members, the `compact` method reorders the members of an `LPBoost` ensemble from largest weight to smallest. Therefore, you can easily remove the least important members of the ensemble using the `removeLearners` method.

For examples using `LPBoost`, see LPBoost and TotalBoost for Small Ensembles.

For references related to `LPBoost`, see Warmuth, Liao, and Ratsch [29].

#### LSBoost

`LSBoost` (least squares boosting) fits regression ensembles. At every step, the ensemble fits a new learner to the difference between the observed response and the aggregated prediction of all learners grown previously. The ensemble fits to minimize mean-squared error.

You can use `LSBoost` with shrinkage by passing in the `LearnRate` parameter. By default this parameter is set to `1`, and the ensemble learns at the maximal speed. If you set `LearnRate` to a value from `0` to `1`, the ensemble fits every new learner to yn – ηf(xn), where

• yn is the observed response.

• f(xn) is the aggregated prediction from all weak learners grown so far for observation xn.

• η is the learning rate.

For examples using `LSBoost`, see Train a Regression Ensemble and Regularize a Regression Ensemble.

For references related to `LSBoost`, see Hastie, Tibshirani, and Friedman [19], Chapters 7 (Model Assessment and Selection) and 15 (Random Forests).

#### RobustBoost

Boosting algorithms such as `AdaBoostM1` and `LogitBoost` increase weights for misclassified observations at every boosting step. These weights can become very large. If this happens, the boosting algorithm sometimes concentrates on a few misclassified observations and neglects the majority of training data. Consequently the average classification accuracy suffers.

In this situation, you can try using `RobustBoost`. This algorithm does not assign almost the entire data weight to badly misclassified observations. It can produce better average classification accuracy.

Unlike `AdaBoostM1` and `LogitBoost`, `RobustBoost` does not minimize a specific loss function. Instead, it maximizes the number of observations with the classification margin above a certain threshold.

`RobustBoost` trains based on time evolution. The algorithm starts at t = 0. At every step, `RobustBoost` solves an optimization problem to find a positive step in time Δt and a corresponding positive change in the average margin for training data Δm. `RobustBoost` stops training and exits if at least one of these three conditions is true:

• Time t reaches 1.

• `RobustBoost` cannot find a solution to the optimization problem with positive updates Δt and Δm.

• `RobustBoost` grows as many learners as you requested.

Results from `RobustBoost` can be usable for any termination condition. Estimate the classification accuracy by cross validation or by using an independent test set.

To get better classification accuracy from `RobustBoost`, you can adjust three parameters in `fitensemble`: `RobustErrorGoal`, `RobustMaxMargin`, and `RobustMarginSigma`. Start by varying values for `RobustErrorGoal` from 0 to 1. The maximal allowed value for `RobustErrorGoal` depends on the two other parameters. If you pass a value that is too high, `fitensemble` produces an error message showing the allowed range for `RobustErrorGoal`.

For examples using `RobustBoost`, see Tune RobustBoost.

For references related to `RobustBoost`, see Freund [15].

#### RUSBoost

`RUSBoost` is especially effective at classifying imbalanced data, meaning some class in the training data has many fewer members than another. RUS stands for Random Under Sampling. The algorithm takes N, the number of members in the class with the fewest members in the training data, as the basic unit for sampling. Classes with more members are under sampled by taking only N observations of every class. In other words, if there are K classes, then, for each weak learner in the ensemble, `RUSBoost` takes a subset of the data with N observations from each of the K classes. The boosting procedure follows the procedure in AdaBoostM2 for reweighting and constructing the ensemble.

When you construct a `RUSBoost` ensemble, there is an optional name-value pair called `RatioToSmallest`. Give a vector of K values, each value representing the multiple of N to sample for the associated class. For example, if the smallest class has N = 100 members, then `RatioToSmallest` = `[2,3,4]` means each weak learner has 200 members in class 1, 300 in class 2, and 400 in class 3. If `RatioToSmallest` leads to a value that is larger than the number of members in a particular class, then `RUSBoost` samples the members with replacement. Otherwise, `RUSBoost` samples the members without replacement.

For examples using `RUSBoost`, see Classification with Imbalanced Data.

For references related to `RUSBoost`, see Seiffert et al. [28].

#### Subspace

Use random subspace ensembles (`Subspace`) to improve the accuracy of discriminant analysis (`ClassificationDiscriminant`) or k-nearest neighbor (`ClassificationKNN`) classifiers. `Subspace` ensembles also have the advantage of using less memory than ensembles with all predictors, and can handle missing values (`NaN`s).

The basic random subspace algorithm uses these parameters.

• m is the number of dimensions (variables) to sample in each learner. Set m using the `NPredToSample` name-value pair.

• d is the number of dimensions in the data, which is the number of columns (predictors) in the data matrix `X`.

• n is the number of learners in the ensemble. Set n using the `NLearn` input.

The basic random subspace algorithm performs the following steps:

1. Choose without replacement a random set of m predictors from the d possible values.

2. Train a weak learner using just the m chosen predictors.

3. Repeat steps 1 and 2 until there are n weak learners.

4. Predict by taking an average of the `score` prediction of the weak learners, and classify the category with the highest average `score`.

You can choose to create a weak learner for every possible set of m predictors from the d dimensions. To do so, set n, the number of learners, to `'AllPredictorCombinations'`. In this case, there are `nchoosek(size(X,2),NPredToSample)` weak learners in the ensemble.

`fitensemble` downweights predictors after choosing them for a learner, so subsequent learners have a lower chance of using a predictor that was previously used. This weighting tends to make predictors more evenly distributed among learners than in uniform weighting.

For examples using `Subspace`, see Random Subspace Classification.

For references related to random subspace ensembles, see Ho [20].

#### TotalBoost

`TotalBoost`, like linear programming boost (`LPBoost`), performs multiclass classification by attempting to maximize the minimal margin in the training set. This attempt uses optimization algorithms, namely quadratic programming for `TotalBoost`. So you need an Optimization Toolbox license to use `LPBoost` or `TotalBoost`.

The margin of a classification is the difference between the predicted soft classification score for the true class, and the largest score for the false classes. For trees, the score of a classification of a leaf node is the posterior probability of the classification at that node. The posterior probability of the classification at a node is the number of training sequences that lead to that node with the classification, divided by the number of training sequences that lead to that node. For more information, see Definitions in `margin`.

Why maximize the minimal margin? For one thing, the generalization error (the error on new data) is the probability of obtaining a negative margin. Schapire and Singer [27] establish this inequality on the probability of obtaining a negative margin:

${P}_{\text{test}}\left(m\le 0\right)\le {P}_{\text{train}}\left(m\le \theta \right)+O\left(\frac{1}{\sqrt{N}}\sqrt{\frac{V{\mathrm{log}}^{2}\left(N/V\right)}{{\theta }^{2}}+\mathrm{log}\left(1/\delta \right)}\right).$

Here m is the margin, θ is any positive number, V is the Vapnik-Chervonenkis dimension of the classifier space, N is the size of the training set, and δ is a small positive number. The inequality holds with probability 1–δ over many i.i.d. training and test sets. This inequality says: To obtain a low generalization error, minimize the number of observations below margin θ in the training set.

`TotalBoost` minimizes a proxy of the Kullback-Leibler divergence between the current weight distribution and the initial weight distribution, subject to the constraint that the edge (the weighted margin) is below a certain value. The proxy is a quadratic expansion of the divergence:

$D\left(W,{W}_{0}\right)=\sum _{n=1}^{N}\mathrm{log}\frac{W\left(n\right)}{{W}_{0}\left(n\right)}\approx \sum _{n=1}^{N}\left(1+\frac{W\left(n\right)}{{W}_{0}\left(n\right)}\right)\Delta +\frac{1}{2W\left(n\right)}{\Delta }^{2},$

where Δ is the difference between W(n), the weights at the current and next iteration, and W0, the initial weight distribution, which is uniform. This optimization formulation keeps weights from becoming zero. At each iteration, there are more constraints in the problem. So, for large problems, the optimization problem becomes increasingly constrained, and slow to solve.

`TotalBoost` typically creates ensembles with many learners having weights that are orders of magnitude smaller than those of other learners. Therefore, to better enable you to remove the unimportant ensemble members, the `compact` method reorders the members of a `TotalBoost` ensemble from largest weight to smallest. Therefore you can easily remove the least important members of the ensemble using the `removeLearners` method.

For examples using `TotalBoost`, see LPBoost and TotalBoost for Small Ensembles.

For references related to `TotalBoost`, see Warmuth, Liao, and Ratsch [29].