screenpredictors
This example shows how to perform predictor screening using screenpredictors
. Predictor screening is a type of univariate analysis performed as an early step in the Credit Scorecard Modeling Workflow. Predictor screening is an important preprocessing step when you work with credit scorecards, as data sets can be prohibitively large and have dozens or hundreds of potential predictors.
The goal of screening predictors is to pare down the set of predictors to a subset that is more useful in predicting the response variable based on the calculated metrics. Screening enables you to select the top predictors as ranked by a given metric to train your credit scorecards.
The credit card data table contains a customer ID (CustID
), nine predictors, and the response variable (status
). Some of the risk factors are more useful in predicting the probability of a loan default, whereas others are less useful. The screening process helps you select the best subset of predictors.
Although the data set in this example contains only a few predictors, in practice, credit scorecard data sets can be very large. The predictor screening process is important as data sets grow to contain dozens or hundreds of predictors.
% Load credit card data tables. load CreditCardData % Use the dataMissing data set, which contains some missing values. data = dataMissing; % Identify the ID and response variables. idvar = 'CustID'; responsevar = 'status'; % Examine the structure of the table. disp(head(data));
CustID CustAge TmAtAddress ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance UtilRate status ______ _______ ___________ ___________ _________ __________ _______ _______ _________ ________ ______ 1 53 62 <undefined> Unknown 50000 55 Yes 1055.9 0.22 0 2 61 22 Home Owner Employed 52000 25 Yes 1161.6 0.24 0 3 47 30 Tenant Employed 37000 61 No 877.23 0.29 0 4 NaN 75 Home Owner Employed 53000 20 Yes 157.37 0.08 0 5 68 56 Home Owner Employed 53000 14 Yes 561.84 0.11 0 6 65 13 Home Owner Employed 48000 59 Yes 968.18 0.15 0 7 34 32 Home Owner Unknown 32000 26 Yes 717.82 0.02 1 8 50 57 Other Employed 51000 33 No 3041.2 0.13 0
Often, derivative predictors can capture additional information or produce better metrics results, for example, the ratio of two predictors or a predictor transformation for predictor x, such as x^2 or log(x). To demonstrate this, create a few derived predictors and add them to the data set.
data.BalanceUtilRatio = data.AMBalance ./ data.UtilRate; data.BalanceIncomeRatio = data.AMBalance ./ data.CustIncome;
Use screenpredictors
to compute several measures of risk factor predictiveness. The columns of the output table contain the metrics values for the predictors. The table is sorted by the information value.
T = screenpredictors(data,'IDVar',idvar,'ResponseVar',responsevar)
T=11×7 table
InfoValue AccuracyRatio AUROC Entropy Gini Chi2PValue PercentMissing
_________ _____________ _______ _______ _______ __________ ______________
CustAge 0.17698 0.1672 0.5836 0.88795 0.42645 0.0020599 0.025
TmWBank 0.15719 0.13612 0.56806 0.89167 0.42864 0.0054591 0
CustIncome 0.15572 0.17758 0.58879 0.891 0.42731 0.0018428 0
BalanceIncomeRatio 0.097073 0.1278 0.5639 0.90024 0.43303 0.11966 0
TmAtAddress 0.094574 0.010421 0.50521 0.90089 0.43377 0.182 0
UtilRate 0.075086 0.035914 0.51796 0.90405 0.43575 0.45546 0
AMBalance 0.07159 0.087142 0.54357 0.90446 0.43592 0.48528 0
BalanceUtilRatio 0.068955 0.026538 0.51327 0.90486 0.43614 0.52517 0
EmpStatus 0.048038 0.10886 0.55443 0.90814 0.4381 0.00037823 0
OtherCC 0.014301 0.044459 0.52223 0.91347 0.44132 0.047616 0
ResStatus 0.0095558 0.049855 0.52493 0.91446 0.44198 0.29879 0.033333
Set thresholds for the predictors based on several metrics. For each metric, adjust the threshold sliders to set the range of passing values. In the plot, green bars indicate predictors that pass the threshold. Red bars indicate predictors that do not pass the threshold. You can omit predictors that do not "pass" the threshold from the final data set.
First, select predictors based on their information value.
infovalueThresh =
0.08;
Visualize the thresholds on the metric values for each predictor using the local function thresholdPlot
, defined at the end of this example.
thresholdPlot(T, infovalueThresh, 'InfoValue')
Select predictors based on their accuracy ratio.
arThresh =0.08; thresholdPlot(T, arThresh, 'AccuracyRatio')
Summarize the thresholding results in table form. The last column indicates which of the predictors passed both of the threshold tests and can be included in the final data set to create the credit scorecard. summaryTable
and displaySummaryTable
are local functions.
metrics = {'InfoValue', 'AccuracyRatio'}; thresholds = [infovalueThresh arThresh]; S = summaryTable(T, metrics, thresholds); displaySummaryTable(S)
InfoValue AccuracyRatio PassedAll _________ _____________ _________ CustAge ✔ ✔ ✔ TmWBank ✔ ✔ ✔ CustIncome ✔ ✔ ✔ BalanceIncomeRatio ✔ ✔ ✔ TmAtAddress ✔ ✘ ✘ UtilRate ✘ ✘ ✘ AMBalance ✘ ✔ ✘ BalanceUtilRatio ✘ ✘ ✘ EmpStatus ✘ ✔ ✘ OtherCC ✘ ✘ ✘ ResStatus ✘ ✘ ✘
Create a reduced table that contains only the passing predictors. Select only the predictors that pass both of the threshold tests and create a reduced data set. The credit scorecard you create using the reduced data set requires less memory.
% Get a list of all passing predictors. predictor_list = T.Row; top_predictors = predictor_list(S.PassedAll); % Trim the data table to contain only the ID, passing predictors, and % response. top_predictor_table = data(:,[idvar; top_predictors; responsevar]); % Create the credit scorecard using the screened predictors. sc = creditscorecard(top_predictor_table,'IDVar',idvar,'ResponseVar',responsevar,... 'BinMissingData', true)
sc = creditscorecard with properties: GoodLabel: 0 ResponseVar: 'status' WeightsVar: '' VarNames: {1x6 cell} NumericPredictors: {1x4 cell} CategoricalPredictors: {1x0 cell} BinMissingData: 1 IDVar: 'CustID' PredictorVars: {1x4 cell} Data: [1200x6 table]
function passed = thresholdPredictor(T, threshold, metric) % Threshold a predictor and return a logical vector to indicate passing % predictors. % Check which predictors pass the threshold. switch metric case {'InfoValue', 'AccuracyRatio', 'AUROC'} passed = T.(metric) >= threshold; case {'Entropy', 'Gini', 'Chi2PValue', 'PercentMissing'} passed = T.(metric) <= threshold; end end function thresholdPlot(T, threshold, metric) % Plot bar charts to summarize predictor selection based on metrics thresholds. % Threshold the predictors. passed = thresholdPredictor(T, threshold, metric); % Get all predictors. predictorNames = T.Row; nPredictors = length(predictorNames); % Create the bar charts. f = figure; ax = axes('parent',f); bAR = bar(ax, 1:nPredictors, T.(metric), 'FaceColor', 'flat'); bAR.CData(passed,:) = repmat([0,1,0],sum(passed),1); bAR.CData(~passed, :) = repmat([1,0,0],sum(~passed),1); ax.TickLabelInterpreter = 'none'; xticks(ax, 1:nPredictors) xticklabels(ax, predictorNames) xtickangle(ax, 45) % Scale the YLim. delta = max(T.(metric)) - min(T.(metric)); d10 = 0.1 * delta; ylim = [min(T.(metric)) - d10 max(T.(metric)) + d10]; set(ax,'YLim',ylim); % Add threshold lines. hold on plot(xlim, [threshold threshold],'k--'); xlabel('Predictor') ylabel(metric) title(sprintf('Predictor Performance by %s',metric)); hold off end function S = summaryTable(T, metrics, thresholds) % Create table summarizing all thresholds. S = T; % Remove metrics that are not thresholded. unthresholded = setdiff(S.Properties.VariableNames, metrics); S(:,unthresholded) = []; % Show thresholding summary. passed_all = true(numel(T.Row),1); for i = 1:numel(metrics) metrici = metrics{i}; thresholdi = thresholds(i); passed = thresholdPredictor(T, thresholdi, metrici); S.(metrici) = passed; passed_all = passed_all & passed; end % Add summary column. S.PassedAll = passed_all; end function displaySummaryTable(S) % Display a summary table with check marks for passed thresholds. cols = S.Properties.VariableNames; % Convert each column to check marks and X marks. for i = 1:numel(cols) coli = cols{i}; charvec = repmat(char(10008),size(S,1),1); % Initialize as 'X'. charvec(S.(coli)) = char(10004); % Check if it passes the threshold. S.(coli) = charvec; end disp(S); end
autobinning
| bindata
| bininfo
| creditscorecard
| displaypoints
| fitmodel
| formatpoints
| modifybins
| modifypredictor
| plotbins
| predictorinfo
| probdefault
| score
| screenpredictors
| setmodel
| validatemodel