Screen Risk Factors by Custom Criteria
This example shows how to use the Screen Risk Factors task to automatically exclude risk factors from a table based on their predictive power.
This example also shows how to set up the screening criteria.
Feature selection is an important step in the development of a statistical model. Input data can have hundreds or thousands of variables, and discarding some variables often improves model interpretability, training times, and other important attributes.
This example loads the Screen Risk Factors data set, which contains a table of customer information such as age, income, and employment status. This example uses pre-defined metrics to assess risk factors individually and analyzes the predictive power of each variable relative to a given (binary) response variable. This example then shows you how to select variables automatically or semi-automatically using the Screen Risk Factors task. This example also shows you how to customize the screening criteria used to assess the risk factors.
Load Data and Pre-defined Screening Criteria
Load the example data from ScreenRiskFactorsData.mat.
load ScreenRiskFactorsData.mat
Construct pre-defined screening criteria in your workspace. Use ExampleScreeningCriteria to generate myCriteria
object. This function returns a ScreeningCriteria
object defined by an mrm.data.selection.TestSuiteFactory.
import mrm.data.selection.*
myCriteria = ExampleScreeningCriteria();
These criteria have been set up as follows:
For each variable, the Information Value and Chi-squared p-value are calculated.
These values are compared against certain thresholds that assign the metric a Pass, Fail or Undecided classification. In this case, the thresholds for the metrics are hard-coded but you can obtain the thresholds from the appropriate data in the development environment.
The overall classification works on the 'worst-of' basis. If the status for either Information Value or the Chi-squared p-value is a Fail, the overall status will be Fail, and so on.
The TestSuiteFactory
sets the StatusInterpreter
of the metrics handler to overallScreeningStatus
. This is where the auto-generated exclusions and comments are set. For the exclusions, the function must assign to each MetricsHandler
state an mrm.data.selection.ScreeningStatus
object (or an ErrorTestStatus
or NullStatus
) to ensures that the Screen Risk Factors task automatically marks the variable for exclusion.
In addition, the percentage of missing entries is displayed. This value does not affect the overall rating.
Launch Screen Risk Factors
Open a new live script and launch the Screen Risk Factors task. This can be done in two ways:
1) Start typing 'Screen' and select the task from the drop-down menu
2) Search for Screen Risk Factors in the Live Task gallery
The task opens in a reduced view until the required inputs are selected:
Input table must be a table or a timetable; the drop-down shows all such objects in the workspace. For this example, select
data
.Response variable drop-down shows all the binary variables in the input table. For this example, select
defaultIndicator
.Criteria should be the
ScreeningCriteria
object you wish to apply - in this casemyCriteria
.
Analyze and Remove Risk Factors
The task now expands.
The task calculates the screening metrics for each risk factor in the input table. The summary of the results is shown in the 'Analyze data variables' section. The table contains one row for each variable in the input table.
'Status' shows the overall classification of the variable based on the screening metrics.
'Exclude' shows whether the variable is to be removed from the data set.
'Comment' contains the reasons for excluding the variable, or for leaving the variable included.
The live task auto populates the 'Exclude' and 'Comment' columns based on the criteria. In this example, the 'Fails' are automatically excluded and 'Passes' are automatically included with automatically generated comments. The 'Undecided' risk factors are left blank for the user to analyze. You can overwrite these auto-completed values and sort the table according to any of these columns.
The area underneath the table is specific to the risk-factor variable and displays the screening metrics, as well as a double histogram that demonstrates how well (or not) the variable discriminates between the two possible responses. To switch the view to another variable, click the variable name in the table.
Document with Modelscape Reporting
The live task dynamically produces two outputs:
filteredTable
: This is a subtable of the input table without the excluded risk factors. Use this subtable in the next step of the model development process.exclusionTable
: This table includes all the data of the input table together with the exclusion flags and comments in the Live Task. To view this information, tick the 'Preview summary tables' box in the 'Display results' section. This information is stored inexclusionTable.Properties.CustomProperties
meta data.
You can insert the above tables into model documentation using the Modelscape Reporting feature. To achieve this, create document holes with titles, say ExclusionSummary
and ProgressSummary
,
in the Word document.
To create document holes in a Word document, view the Developer tab, and click the 'Rich Text Content Control' symbol Aa in the Controls area. Then click 'Properties', and fill in the Title fields.
import mrm.data.filter.*
[ExclusionSummary, ProgressSummary] = summarizeExclusionTable(exclusionTable)
After you have created holes, pick up the new variables from the MATLAB workspace and insert them into the model document using fillReportFromWorkspace.
For examples of creating document holes and for more details on the use of fillReportFromWorkspace
, see Model Documentation in Modelscape.
Set Up Custom Criteria
To learn about test metrics, thresholds, and handlers used by screening criteria object, refer to the Test Metrics in Modelscape and Metrics Handlers examples.
You can customize the criteria used to screen variables in the Screen Risk Factors Live Task. The criteria must be in an mrm.data.selection.ScreeningCriteria
object. For the class definition, run:
edit mrm.data.selection.ScreeningCriteria
This class is a holder for a handle to a function f.
f(inputData, 'PredictorVar', varName, 'ResponseVar', respVar)
The function f
call must be well-defined and produce an mrm.
data.validation
.MetricsHandler
object for any table or timetable inputData
, any predictor variable varName
, and for a given binary response variable respVar
. TestSuiteFactory
has this signature for the function call.
To see examples of these functions in the Modelscape package, run
edit mrm.data.selection.ExampleScreeningCriteria; edit mrm.data.selection.TestSuiteFactory; edit mrm.data.selection.overallScreeningStatus;