Main Content

Screen Risk Factors by Custom Criteria

This example shows how to use the Screen Risk Factors task to automatically exclude risk factors from a table based on their predictive power.

This example also shows how to set up the screening criteria.

Feature selection is an important step in the development of a statistical model. Input data can have hundreds or thousands of variables, and discarding some variables often improves model interpretability, training times, and other important attributes.

This example loads the Screen Risk Factors data set, which contains a table of customer information such as age, income, and employment status. This example uses pre-defined metrics to assess risk factors individually and analyzes the predictive power of each variable relative to a given (binary) response variable. This example then shows you how to select variables automatically or semi-automatically using the Screen Risk Factors task. This example also shows you how to customize the screening criteria used to assess the risk factors.

Load Data and Pre-defined Screening Criteria

Load the example data from ScreenRiskFactorsData.mat.

load ScreenRiskFactorsData.mat

Construct pre-defined screening criteria in your workspace. Use ExampleScreeningCriteria to generate myCriteria object. This function returns a ScreeningCriteria object defined by an mrm.data.selection.TestSuiteFactory.

import mrm.data.selection.*
myCriteria = ExampleScreeningCriteria();

These criteria have been set up as follows:

  1. For each variable, the Information Value and Chi-squared p-value are calculated.

  2. These values are compared against certain thresholds that assign the metric a Pass, Fail or Undecided classification. In this case, the thresholds for the metrics are hard-coded but you can obtain the thresholds from the appropriate data in the development environment.

  3. The overall classification works on the 'worst-of' basis. If the status for either Information Value or the Chi-squared p-value is a Fail, the overall status will be Fail, and so on.

The TestSuiteFactory sets the StatusInterpreter of the metrics handler to overallScreeningStatus. This is where the auto-generated exclusions and comments are set. For the exclusions, the function must assign to each MetricsHandler state an mrm.data.selection.ScreeningStatus object (or an ErrorTestStatus or NullStatus) to ensures that the Screen Risk Factors task automatically marks the variable for exclusion.

In addition, the percentage of missing entries is displayed. This value does not affect the overall rating.

Launch Screen Risk Factors

Open a new live script and launch the Screen Risk Factors task. This can be done in two ways:

1) Start typing 'Screen' and select the task from the drop-down menu

ScreeningTabComplete.PNG

2) Search for Screen Risk Factors in the Live Task gallery

The task opens in a reduced view until the required inputs are selected:

  • Input table must be a table or a timetable; the drop-down shows all such objects in the workspace. For this example, select data.

  • Response variable drop-down shows all the binary variables in the input table. For this example, select defaultIndicator.

  • Criteria should be the ScreeningCriteria object you wish to apply - in this case myCriteria.

Analyze and Remove Risk Factors

The task now expands.

ScreeningTaskInAction.PNG

The task calculates the screening metrics for each risk factor in the input table. The summary of the results is shown in the 'Analyze data variables' section. The table contains one row for each variable in the input table.

  • 'Status' shows the overall classification of the variable based on the screening metrics.

  • 'Exclude' shows whether the variable is to be removed from the data set.

  • 'Comment' contains the reasons for excluding the variable, or for leaving the variable included.

The live task auto populates the 'Exclude' and 'Comment' columns based on the criteria. In this example, the 'Fails' are automatically excluded and 'Passes' are automatically included with automatically generated comments. The 'Undecided' risk factors are left blank for the user to analyze. You can overwrite these auto-completed values and sort the table according to any of these columns.

The area underneath the table is specific to the risk-factor variable and displays the screening metrics, as well as a double histogram that demonstrates how well (or not) the variable discriminates between the two possible responses. To switch the view to another variable, click the variable name in the table.

Document with Modelscape Reporting

The live task dynamically produces two outputs:

  • filteredTable: This is a subtable of the input table without the excluded risk factors. Use this subtable in the next step of the model development process.

  • exclusionTable: This table includes all the data of the input table together with the exclusion flags and comments in the Live Task. To view this information, tick the 'Preview summary tables' box in the 'Display results' section. This information is stored in exclusionTable.Properties.CustomProperties meta data.

ScreeningSummaryTablePreview.PNG

You can insert the above tables into model documentation using the Modelscape Reporting feature. To achieve this, create document holes with titles, say ExclusionSummary and ProgressSummary, in the Word document.

To create document holes in a Word document, view the Developer tab, and click the 'Rich Text Content Control' symbol Aa in the Controls area. Then click 'Properties', and fill in the Title fields.

import mrm.data.filter.*
[ExclusionSummary, ProgressSummary] = summarizeExclusionTable(exclusionTable)

After you have created holes, pick up the new variables from the MATLAB workspace and insert them into the model document using fillReportFromWorkspace.

For examples of creating document holes and for more details on the use of fillReportFromWorkspace, see Model Documentation in Modelscape.

Set Up Custom Criteria

To learn about test metrics, thresholds, and handlers used by screening criteria object, refer to the Test Metrics in Modelscape and Metrics Handlers examples.

You can customize the criteria used to screen variables in the Screen Risk Factors Live Task. The criteria must be in an mrm.data.selection.ScreeningCriteria object. For the class definition, run:

edit mrm.data.selection.ScreeningCriteria

This class is a holder for a handle to a function f.

f(inputData, 'PredictorVar', varName, 'ResponseVar', respVar)

The function f call must be well-defined and produce an mrm.data.validation.MetricsHandler object for any table or timetable inputData, any predictor variable varName, and for a given binary response variable respVar. TestSuiteFactory has this signature for the function call.

To see examples of these functions in the Modelscape package, run

edit mrm.data.selection.ExampleScreeningCriteria;
edit mrm.data.selection.TestSuiteFactory;
edit mrm.data.selection.overallScreeningStatus;