Main Content

Automated Feature Engineering for Classification

The gencfeatures function enables you to automate the feature engineering process in the context of a machine learning workflow. Before passing tabular training data to a classifier, you can create new features from the predictors in the data by using gencfeatures. Use the returned data to train the classifier.

Generate new features based on your machine learning workflow.

  • To generate features for an interpretable binary classifier, use the default TargetLearner value of "linear" in the call to gencfeatures. You can then use the returned data to train a binary linear classifier. For an example, see Interpret Linear Model with Generated Features.

  • To generate features that can lead to better model accuracy, specify TargetLearner="bag" or TargetLearner="gaussian-svm" in the call to gencfeatures. You can then use the returned data to train a bagged ensemble classifier or a binary support vector machine (SVM) classifier with a Gaussian kernel, respectively. For an example, see Generate New Features to Improve Bagged Ensemble Accuracy.

To better understand the generated features, use the describe function of the FeatureTransformer object. To apply the same training set feature transformations to a test or validation set, use the transform function of the FeatureTransformer object.

Interpret Linear Model with Generated Features

Use automated feature engineering to generate new features. Train a linear classifier using the generated features. Interpret the relationship between the generated features and the trained model.

Load the patients data set. Create a table from a subset of the variables. Display the first few rows of the table.

load patients
Tbl = table(Age,Diastolic,Gender,Height,SelfAssessedHealthStatus, ...
    Systolic,Weight,Smoker);
head(Tbl)
    Age    Diastolic      Gender      Height    SelfAssessedHealthStatus    Systolic    Weight    Smoker
    ___    _________    __________    ______    ________________________    ________    ______    ______

    38        93        {'Male'  }      71           {'Excellent'}            124        176      true  
    43        77        {'Male'  }      69           {'Fair'     }            109        163      false 
    38        83        {'Female'}      64           {'Good'     }            125        131      false 
    40        75        {'Female'}      67           {'Fair'     }            117        133      false 
    49        80        {'Female'}      64           {'Good'     }            122        119      false 
    46        70        {'Female'}      68           {'Good'     }            121        142      false 
    33        88        {'Female'}      64           {'Good'     }            130        142      true  
    40        82        {'Male'  }      68           {'Good'     }            115        180      false 

Generate 10 new features from the variables in Tbl. Specify the Smoker variable as the response. By default, gencfeatures assumes that the new features will be used to train a binary linear classifier.

rng("default") % For reproducibility
[T,NewTbl] = gencfeatures(Tbl,"Smoker",10)
T = 
  FeatureTransformer with properties:

                     Type: 'classification'
            TargetLearner: 'linear'
    NumEngineeredFeatures: 10
      NumOriginalFeatures: 0
         TotalNumFeatures: 10

NewTbl=100×11 table
    zsc(Systolic.^2)    eb8(Diastolic)    q8(Systolic)    eb8(Systolic)    q8(Diastolic)    zsc(kmd9)    zsc(sin(Age))    zsc(sin(Weight))    zsc(Height-Systolic)    zsc(kmc1)    Smoker
    ________________    ______________    ____________    _____________    _____________    _________    _____________    ________________    ____________________    _________    ______

         0.15379              8                6                4                8           -1.7207        0.50027            0.19202               0.40418            0.76177    true  
         -1.9421              2                1                1                2          -0.22056        -1.1319            -0.4009                2.3431             1.1617    false 
         0.30311              4                6                5                5           0.57695        0.50027             -1.037              -0.78898            -1.4456    false 
        -0.85785              2                2                2                2           0.83391         1.1495             1.3039               0.85162          -0.010294    false 
        -0.14125              3                5                4                4             1.779        -1.3083           -0.42387              -0.34154            0.99368    false 
        -0.28697              1                4                3                1           0.67326         1.3761           -0.72529               0.40418             1.3755    false 
          1.0677              6                8                6                6          -0.42521         1.5181           -0.72529               -1.5347            -1.4456    true  
         -1.1361              4                2                2                5          -0.79995         1.1495            -1.0225                1.2991             1.1617    false 
         -1.1361              3                2                2                3          -0.80136        0.46343             1.0806                1.2991             -1.208    false 
        -0.71693              5                3                3                6           0.37961       -0.51304            0.16741               0.55333            -1.4456    false 
         -1.2734              2                1                1                2            1.2572         1.3025             1.0978                1.4482          -0.010294    false 
         -1.1361              1                2                2                1             1.001        -1.2545            -1.2194                1.0008          -0.010294    false 
         0.60534              1                6                5                1          -0.98493       -0.11998             -1.211             -0.043252             -1.208    false 
          1.0677              8                8                6                8          -0.27307         1.4659             1.2168              -0.34154            0.24706    true  
         -1.2734              3                1                1                4           0.93395        -1.3633           -0.17603                1.0008          -0.010294    false 
          1.0677              7                8                6                8          -0.91396          -1.04            -1.2109              -0.49069            0.24706    true  
      ⋮

T is a FeatureTransformer object that can be used to transform new data, and newTbl contains the new features generated from the Tbl data.

To better understand the generated features, use the describe object function of the FeatureTransformer object. For example, inspect the first two generated features.

describe(T,1:2)
                           Type        IsOriginal    InputVariables                            Transformations
                        ___________    __________    ______________    _______________________________________________________________

    zsc(Systolic.^2)    Numeric          false         Systolic        power(  ,2)
                                                                       Standardization with z-score (mean = 15119.54, std = 1667.5858)
    eb8(Diastolic)      Categorical      false         Diastolic       Equal-width binning (number of bins = 8)

The first feature in newTbl is a numeric variable, created by first squaring the values of the Systolic variable and then converting the results to z-scores. The second feature in newTbl is a categorical variable, created by binning the values of the Diastolic variable into 8 bins of equal width.

Use the generated features to fit a linear classifier without any regularization.

Mdl = fitclinear(NewTbl,"Smoker",Lambda=0);

Plot the coefficients of the predictors used to train Mdl. Note that fitclinear expands categorical predictors before fitting a model.

p = length(Mdl.Beta);
[sortedCoefs,expandedIndex] = sort(Mdl.Beta,ComparisonMethod="abs");
sortedExpandedPreds = Mdl.ExpandedPredictorNames(expandedIndex);
bar(sortedCoefs,Horizontal="on")
yticks(1:2:p)
yticklabels(sortedExpandedPreds(1:2:end))
xlabel("Coefficient")
ylabel("Expanded Predictors")
title("Coefficients for Expanded Predictors")

Figure contains an axes object. The axes object with title Coefficients for Expanded Predictors, xlabel Coefficient, ylabel Expanded Predictors contains an object of type bar.

Identify the predictors whose coefficients have larger absolute values.

bigCoefs = abs(sortedCoefs) >= 4;
flip(sortedExpandedPreds(bigCoefs))
ans = 1x7 cell
    {'zsc(Systolic.^2)'}    {'eb8(Systolic) >= 5'}    {'eb8(Diastolic) >= 3'}    {'q8(Diastolic) >= 3'}    {'q8(Systolic) >= 6'}    {'q8(Diastolic) >= 6'}    {'zsc(Height-Systolic)'}

You can use partial dependence plots to analyze the categorical features whose levels have large coefficients in terms of absolute value. For example, inspect the partial dependence plot for the q8(Diastolic) variable, whose levels q8(Diastolic) >= 3 and q8(Diastolic) >= 6 have coefficients with large absolute values. These two levels correspond to noticeable changes in the predicted scores.

plotPartialDependence(Mdl,"q8(Diastolic)",Mdl.ClassNames,NewTbl);

Figure contains an axes object. The axes object with title Partial Dependence Plot, xlabel q8(Diastolic), ylabel Scores contains 2 objects of type line. These objects represent 0, 1.

Generate New Features to Improve Bagged Ensemble Accuracy

Use gencfeatures to engineer new features before training a bagged ensemble classifier. Before making predictions on new data, apply the same feature transformations to the new data set. Compare the test set performance of the ensemble that uses the engineered features to the test set performance of the ensemble that uses the original features.

Read the sample file CreditRating_Historical.dat into a table. The predictor data consists of financial ratios and industry sector information for a list of corporate customers. The response variable consists of credit ratings assigned by a rating agency. Preview the first few rows of the data set.

creditrating = readtable("CreditRating_Historical.dat");
head(creditrating)
     ID      WC_TA     RE_TA     EBIT_TA    MVE_BVTD    S_TA     Industry    Rating 
    _____    ______    ______    _______    ________    _____    ________    _______

    62394     0.013     0.104     0.036      0.447      0.142        3       {'BB' }
    48608     0.232     0.335     0.062      1.969      0.281        8       {'A'  }
    42444     0.311     0.367     0.074      1.935      0.366        1       {'A'  }
    48631     0.194     0.263     0.062      1.017      0.228        4       {'BBB'}
    43768     0.121     0.413     0.057      3.647      0.466       12       {'AAA'}
    39255    -0.117    -0.799      0.01      0.179      0.082        4       {'CCC'}
    62236     0.087     0.158     0.049      0.816      0.324        2       {'BBB'}
    39354     0.005     0.181     0.034      2.597      0.388        7       {'AA' }

Because each value in the ID variable is a unique customer ID, that is, length(unique(creditrating.ID)) is equal to the number of observations in creditrating, the ID variable is a poor predictor. Remove the ID variable from the table, and convert the Industry variable to a categorical variable.

creditrating = removevars(creditrating,"ID");
creditrating.Industry = categorical(creditrating.Industry);

Convert the Rating response variable to a categorical variable.

creditrating.Rating = categorical(creditrating.Rating, ...
    ["AAA","AA","A","BBB","BB","B","CCC"]);

Partition the data into training and test sets. Use approximately 75% of the observations as training data, and 25% of the observations as test data. Partition the data using cvpartition.

rng("default") % For reproducibility of the partition
c = cvpartition(creditrating.Rating,Holdout=0.25);
trainingIndices = training(c); % Indices for the training set
testIndices = test(c); % Indices for the test set
creditTrain = creditrating(trainingIndices,:);
creditTest = creditrating(testIndices,:);

Use the training data to generate 40 new features to fit a bagged ensemble. By default, the 40 features include original features that can be used as predictors by a bagged ensemble.

[T,newCreditTrain] = gencfeatures(creditTrain,"Rating",40, ...
    TargetLearner="bag");
T
T = 
  FeatureTransformer with properties:

                     Type: 'classification'
            TargetLearner: 'bag'
    NumEngineeredFeatures: 34
      NumOriginalFeatures: 6
         TotalNumFeatures: 40

Create newCreditTest by applying the transformations stored in the object T to the test data.

newCreditTest = transform(T,creditTest);

Compare the test set performances of a bagged ensemble trained on the original features and a bagged ensemble trained on the new features.

Train a bagged ensemble using the original training set creditTrain. Compute the accuracy of the model on the original test set creditTest. Visualize the results using a confusion matrix.

originalMdl = fitcensemble(creditTrain,"Rating",Method="Bag");
originalTestAccuracy = 1 - loss(originalMdl,creditTest, ...
    "Rating",LossFun="classiferror")
originalTestAccuracy = 
0.7542
predictedTestLabels = predict(originalMdl,creditTest);
confusionchart(creditTest.Rating,predictedTestLabels);

Figure contains an object of type ConfusionMatrixChart.

Train a bagged ensemble using the transformed training set newCreditTrain. Compute the accuracy of the model on the transformed test set newCreditTest. Visualize the results using a confusion matrix.

newMdl = fitcensemble(newCreditTrain,"Rating",Method="Bag");
newTestAccuracy = 1 - loss(newMdl,newCreditTest, ...
    "Rating",LossFun="classiferror")
newTestAccuracy = 
0.7461
newPredictedTestLabels = predict(newMdl,newCreditTest);
confusionchart(newCreditTest.Rating,newPredictedTestLabels)

Figure contains an object of type ConfusionMatrixChart.

The bagged ensemble trained on the transformed data seems to outperform the bagged ensemble trained on the original data.

See Also

| | | | | | | |