This example shows how to handle missing values when working with creditscorecard
objects. First, the example shows how to use the creditscorecard
functionality to create an explicit bin for missing data with corresponding points. Then, the example shows how to "treat" the missing data to get a final scorecard with no explicit bins for missing values. To develop a scorecard without explicit bins for missing data, data must be treated and treatments for old and new data sets prior to scoring must be consistent.
When creating a creditscorecard
object, the data may contain missing values. When using creditscorecard
to create a creditscorecard
object, you can specify the name-value pair argument for 'BinMissingData'
set to true
. In this case, the missing data for numeric predictors (NaN
values) and for categorical predictors (<undefined>
values) is binned in a separate bin labeled <missing>
that appears at the end of the bins. Predictors with no missing values in the training data have no <missing>
bin. If you do not specify the 'BinMissingData'
argument or if you set 'BinMissingData'
to false
, the creditscorecard
function discards missing observations when computing frequencies of Good and Bad, and neither the bininfo
nor plotbins
functions reports such observations.
The <missing>
bin remains in place throughout the scorecard modeling process. The final scorecard explicitly indicates the points to be assigned to missing values for predictors that have a <missing>
bin. These points are determined from the Weight-of-Evidence (WOE) of the <missing>
bin and the predictor's coefficient in the logistic model. For predictors without an explicit <missing>
bin, you can assign points to missing values.
The dataMissing
table in the CreditCardData.mat
file has two predictors, CustAge
and ResStatus
, with missing values.
load CreditCardData.mat
head(dataMissing,5)
ans=5×11 table
CustID CustAge TmAtAddress ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance UtilRate status
______ _______ ___________ ___________ _________ __________ _______ _______ _________ ________ ______
1 53 62 <undefined> Unknown 50000 55 Yes 1055.9 0.22 0
2 61 22 Home Owner Employed 52000 25 Yes 1161.6 0.24 0
3 47 30 Tenant Employed 37000 61 No 877.23 0.29 0
4 NaN 75 Home Owner Employed 53000 20 Yes 157.37 0.08 0
5 68 56 Home Owner Employed 53000 14 Yes 561.84 0.11 0
Create a creditscorecard
object using the CreditCardData.mat
file to load the dataMissing
table with missing values. Set the 'BinMissingData'
argument to true
. Apply automatic binning.
sc = creditscorecard(dataMissing,'IDVar','CustID','BinMissingData',true); sc = autobinning(sc);
The bin information and bin plots for the predictors that have missing data both show a <missing>
bin at the end.
bi = bininfo(sc,'CustAge');
disp(bi)
Bin Good Bad Odds WOE InfoValue _____________ ____ ___ ______ ________ __________ {'[-Inf,33)'} 69 52 1.3269 -0.42156 0.018993 {'[33,37)' } 63 45 1.4 -0.36795 0.012839 {'[37,40)' } 72 47 1.5319 -0.2779 0.0079824 {'[40,46)' } 172 89 1.9326 -0.04556 0.0004549 {'[46,48)' } 59 25 2.36 0.15424 0.0016199 {'[48,51)' } 99 41 2.4146 0.17713 0.0035449 {'[51,58)' } 157 62 2.5323 0.22469 0.0088407 {'[58,Inf]' } 93 25 3.72 0.60931 0.032198 {'<missing>'} 19 11 1.7273 -0.15787 0.00063885 {'Totals' } 803 397 2.0227 NaN 0.087112
plotbins(sc,'CustAge')
bi = bininfo(sc,'ResStatus');
disp(bi)
Bin Good Bad Odds WOE InfoValue ______________ ____ ___ ______ _________ __________ {'Tenant' } 296 161 1.8385 -0.095463 0.0035249 {'Home Owner'} 352 171 2.0585 0.017549 0.00013382 {'Other' } 128 52 2.4615 0.19637 0.0055808 {'<missing>' } 27 13 2.0769 0.026469 2.3248e-05 {'Totals' } 803 397 2.0227 NaN 0.0092627
plotbins(sc,'ResStatus')
The training data for the 'CustAge'
and 'ResStatus'
predictors has missing data (NaN
s and <undefined>
). The binning process estimates WOE values of -0.15787
and 0.026469
, respectively, for the missing data in these predictors.
The training data for EmpStatus
and CustIncome
has no explicit bin for <missing>
values because there are no missing values for these predictors.
bi = bininfo(sc,'EmpStatus');
disp(bi)
Bin Good Bad Odds WOE InfoValue ____________ ____ ___ ______ ________ _________ {'Unknown' } 396 239 1.6569 -0.19947 0.021715 {'Employed'} 407 158 2.5759 0.2418 0.026323 {'Totals' } 803 397 2.0227 NaN 0.048038
bi = bininfo(sc,'CustIncome');
disp(bi)
Bin Good Bad Odds WOE InfoValue _________________ ____ ___ _______ _________ __________ {'[-Inf,29000)' } 53 58 0.91379 -0.79457 0.06364 {'[29000,33000)'} 74 49 1.5102 -0.29217 0.0091366 {'[33000,35000)'} 68 36 1.8889 -0.06843 0.00041042 {'[35000,40000)'} 193 98 1.9694 -0.026696 0.00017359 {'[40000,42000)'} 68 34 2 -0.011271 1.0819e-05 {'[42000,47000)'} 164 66 2.4848 0.20579 0.0078175 {'[47000,Inf]' } 183 56 3.2679 0.47972 0.041657 {'Totals' } 803 397 2.0227 NaN 0.12285
Use fitmodel
to fit a logistic regression model using Weight of Evidence (WOE) values. fitmodel
internally transforms all the predictor variables into WOE values, using the bins found during the automatic binning process. By default, fitmodel
then fits a logistic regression model using a stepwise method. For predictors that have missing data, there is an explicit <missing>
bin with a corresponding WOE value computed from the data. When using fitmodel
, the corresponding WOE value for the <missing>
bin is applied when performing the WOE transformation.
[sc,mdl] = fitmodel(sc);
1. Adding CustIncome, Deviance = 1490.8527, Chi2Stat = 32.588614, PValue = 1.1387992e-08 2. Adding TmWBank, Deviance = 1467.1415, Chi2Stat = 23.711203, PValue = 1.1192909e-06 3. Adding AMBalance, Deviance = 1455.5715, Chi2Stat = 11.569967, PValue = 0.00067025601 4. Adding EmpStatus, Deviance = 1447.3451, Chi2Stat = 8.2264038, PValue = 0.0041285257 5. Adding CustAge, Deviance = 1442.8477, Chi2Stat = 4.4974731, PValue = 0.033944979 6. Adding ResStatus, Deviance = 1438.9783, Chi2Stat = 3.86941, PValue = 0.049173805 7. Adding OtherCC, Deviance = 1434.9751, Chi2Stat = 4.0031966, PValue = 0.045414057 Generalized linear regression model: status ~ [Linear formula with 8 terms in 7 predictors] Distribution = Binomial Estimated Coefficients: Estimate SE tStat pValue ________ ________ ______ __________ (Intercept) 0.70229 0.063959 10.98 4.7498e-28 CustAge 0.57421 0.25708 2.2335 0.025513 ResStatus 1.3629 0.66952 2.0356 0.04179 EmpStatus 0.88373 0.2929 3.0172 0.002551 CustIncome 0.73535 0.2159 3.406 0.00065929 TmWBank 1.1065 0.23267 4.7556 1.9783e-06 OtherCC 1.0648 0.52826 2.0156 0.043841 AMBalance 1.0446 0.32197 3.2443 0.0011775 1200 observations, 1192 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 88.5, p-value = 2.55e-16
Scale the scorecard points by the points-to-double-the-odds (PDO) method using the 'PointsOddsAndPDO'
argument of formatpoints
. Suppose that you want a score of 500 points to have odds of 2 (twice as likely to be good than to be bad) and that the odds double every 50 points (so that 550 points would have odds of 4).
Display the scorecard showing the scaled points for predictors retained in the fitting model.
sc = formatpoints(sc,'PointsOddsAndPDO',[500 2 50]);
PointsInfo = displaypoints(sc)
PointsInfo=38×3 table
Predictors Bin Points
_____________ ______________ ______
{'CustAge' } {'[-Inf,33)' } 54.062
{'CustAge' } {'[33,37)' } 56.282
{'CustAge' } {'[37,40)' } 60.012
{'CustAge' } {'[40,46)' } 69.636
{'CustAge' } {'[46,48)' } 77.912
{'CustAge' } {'[48,51)' } 78.86
{'CustAge' } {'[51,58)' } 80.83
{'CustAge' } {'[58,Inf]' } 96.76
{'CustAge' } {'<missing>' } 64.984
{'ResStatus'} {'Tenant' } 62.138
{'ResStatus'} {'Home Owner'} 73.248
{'ResStatus'} {'Other' } 90.828
{'ResStatus'} {'<missing>' } 74.125
{'EmpStatus'} {'Unknown' } 58.807
{'EmpStatus'} {'Employed' } 86.937
{'EmpStatus'} {'<missing>' } NaN
⋮
Notice that points for the <missing>
bin for CustAge
and ResStatus
are explicitly shown (as 64.9836
and 74.1250
, respectively). These points are computed from the WOE value for the <missing>
bin and the logistic model coefficients.
Predictors that have no missing data in the training set have no explicit <missing>
bin. By default, the points are set to NaN
for missing data and they lead to a score of NaN
when running score
. For predictors that have no explicit <missing>
bin, use the name-value argument 'Missing'
in formatpoints
to indicate how missing data should be treated for scoring purposes.
The scorecard is ready for scoring new data sets. You can also use the scorecard to compute probabilities of default or perform model validation. For details, see score
, probdefault
, and validatemodel
. To further explore the handling of missing data, take a few rows from the original data as test data and introduce some missing data.
tdata = dataMissing(11:14,mdl.PredictorNames); % Keep only the predictors retained in the model % Set some missing values tdata.CustAge(1) = NaN; tdata.ResStatus(2) = '<undefined>'; tdata.EmpStatus(3) = '<undefined>'; tdata.CustIncome(4) = NaN; disp(tdata)
CustAge ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance _______ ___________ ___________ __________ _______ _______ _________ NaN Tenant Unknown 34000 44 Yes 119.8 48 <undefined> Unknown 44000 14 Yes 403.62 65 Home Owner <undefined> 48000 6 No 111.88 44 Other Unknown NaN 35 No 436.41
Score the new data and see how points for missing data are differently assigned for CustAge
and ResStatus
and for EmpStatus
and CustIncome
. CustAge
and ResStatus
have an explicit <missing>
bin for missing data. However, for EmpStatus
and CustIncome
the score
function sets the points to NaN
.
[Scores,Points] = score(sc,tdata); disp(Scores)
481.2231 520.8353 NaN NaN
disp(Points)
CustAge ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance _______ _________ _________ __________ _______ _______ _________ 64.984 62.138 58.807 67.893 61.858 75.622 89.922 78.86 74.125 58.807 82.439 61.061 75.622 89.922 96.76 73.248 NaN 96.969 51.132 50.914 89.922 69.636 90.828 58.807 NaN 61.858 50.914 89.922
Use the name-value argument 'Missing'
in formatpoints
to choose how to assign points to missing values for predictors that do not have an explicit <missing>
bin. For this example, use the 'MinPoints'
option for the 'Missing'
argument. For EmpStatus
and CustIncome
, the minimum numbers of points in the scorecard are 58.8072
and 29.3753
, respectively.
sc = formatpoints(sc,'Missing','MinPoints'); [Scores,Points] = score(sc,tdata); disp(Scores)
481.2231 520.8353 517.7532 451.3405
disp(Points)
CustAge ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance _______ _________ _________ __________ _______ _______ _________ 64.984 62.138 58.807 67.893 61.858 75.622 89.922 78.86 74.125 58.807 82.439 61.061 75.622 89.922 96.76 73.248 58.807 96.969 51.132 50.914 89.922 69.636 90.828 58.807 29.375 61.858 50.914 89.922
You can use one of two alternative workflows to develop a scorecard without explicit bins for missing data.
The first alternative is to discard the missing data during the analysis. If the creditscorecard
is created with the 'BinMissingData'
argument set to false
(by default, it is set to false
if not specified), the missing observations are discarded when computing frequencies of Good
and Bad
and are not reported by bininfo
or plotbins
. For the fitting of the logistic model, rows with missing values are also discarded. With this approach, the missing data indirectly influences the results because the total number of observations used to compute bin statistics such as Weight-of-Evidence (WOE), or the total number of rows used to fit a logistic model, is reduced by the number of missing observations. For more information on this workflow, see Credit Scorecard Modeling Workflow.
The second alternative is to first gather information about the missing values, then treat or replace the missing values so that the training data has no missing observations, and then create a creditscorecard
object with the treated data set. This approach modifies the training data, allowing the reporting of missing observations in the bin counts and the inclusion of missing observations for fitting the logistic model. However, in this approach, the treatment of the training data and the treatment of any new data set that requires scoring must be the same.
The following example explains the second alternative workflow, which gathers missing data, treats the training data, develops a new creditscorecard
, and treats new data before scoring.
The dataMissing
table in the CreditCardData.mat
file has two predictors, CustAge
and ResStatus
, with missing values.
load CreditCardData.mat
head(dataMissing,5)
ans=5×11 table
CustID CustAge TmAtAddress ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance UtilRate status
______ _______ ___________ ___________ _________ __________ _______ _______ _________ ________ ______
1 53 62 <undefined> Unknown 50000 55 Yes 1055.9 0.22 0
2 61 22 Home Owner Employed 52000 25 Yes 1161.6 0.24 0
3 47 30 Tenant Employed 37000 61 No 877.23 0.29 0
4 NaN 75 Home Owner Employed 53000 20 Yes 157.37 0.08 0
5 68 56 Home Owner Employed 53000 14 Yes 561.84 0.11 0
First, use the untreated training data to analyze the missing data information.
Create a creditscorecard
object using the CreditCardData.mat
file to load the dataMissing
with missing values and set the 'BinMissingData'
argument to true
to explicitly report information on missing values. Apply automatic binning.
sc = creditscorecard(dataMissing,'IDVar','CustID','BinMissingData',true); sc = autobinning(sc);
The bin information and bin plots for predictors that have missing data both show a <missing>
bin at the end. The two predictors with missing values in this data set are CustAge
and ResStatus
.
bi = bininfo(sc,'CustAge');
disp(bi)
Bin Good Bad Odds WOE InfoValue _____________ ____ ___ ______ ________ __________ {'[-Inf,33)'} 69 52 1.3269 -0.42156 0.018993 {'[33,37)' } 63 45 1.4 -0.36795 0.012839 {'[37,40)' } 72 47 1.5319 -0.2779 0.0079824 {'[40,46)' } 172 89 1.9326 -0.04556 0.0004549 {'[46,48)' } 59 25 2.36 0.15424 0.0016199 {'[48,51)' } 99 41 2.4146 0.17713 0.0035449 {'[51,58)' } 157 62 2.5323 0.22469 0.0088407 {'[58,Inf]' } 93 25 3.72 0.60931 0.032198 {'<missing>'} 19 11 1.7273 -0.15787 0.00063885 {'Totals' } 803 397 2.0227 NaN 0.087112
plotbins(sc,'CustAge')
bi = bininfo(sc,'ResStatus');
disp(bi)
Bin Good Bad Odds WOE InfoValue ______________ ____ ___ ______ _________ __________ {'Tenant' } 296 161 1.8385 -0.095463 0.0035249 {'Home Owner'} 352 171 2.0585 0.017549 0.00013382 {'Other' } 128 52 2.4615 0.19637 0.0055808 {'<missing>' } 27 13 2.0769 0.026469 2.3248e-05 {'Totals' } 803 397 2.0227 NaN 0.0092627
plotbins(sc,'ResStatus')
To treat missing values, you can apply different criteria. This example follows a straightforward approach to replace missing observations with the most common or typical value in the data distribution, which is the value of mode
for the data. For this example, the mode
happens to have a similar WOE value as the original <missing>
bin. The similarity in values is favorable because similar WOE values means similar points in a scorecard.
For CustAge
, bin 4 is the bin with the most observations and the mode
value of the original data is 43
.
modeCustAge = mode(dataMissing.CustAge); disp(modeCustAge)
43
The WOE value of the <missing>
bin is similar to the WOE value of bin 4. Therefore, replacing the missing values in CustAge
with the value of mode
is reasonable.
To treat the data, create a copy of the data and fill the missing values.
dataTreated = dataMissing;
dataTreated.CustAge = fillmissing(dataTreated.CustAge,'constant',modeCustAge);
For ResStatus
, the value of 'Home Owner'
is the value of the mode
of the data, and the WOE value of the <missing>
bin is closest to that of the 'Home Owner'
bin.
modeResStatus = mode(dataMissing.ResStatus); disp(modeResStatus)
Home Owner
Replace the missing data with 'Home Owner'
. Replacing the missing values preserves both the observed WOE values and the typical characteristics observed in the data set.
dataTreated.ResStatus = fillmissing(dataTreated.ResStatus,'constant',string(modeResStatus));
The treated data set now has no missing values.
disp(any(any(ismissing(dataTreated))))
0
Using the treated data set, apply the typical creditscorecard
workflow. Create a creditscorecard
object with the treated data and applying automatic binning.
scTreated = creditscorecard(dataTreated,'IDVar','CustID'); scTreated = autobinning(scTreated);
Compare the bin information of the untreated data for CustAge
with the bin information of the treated data for CustAge
.
bi = bininfo(sc,'CustAge');
disp(bi)
Bin Good Bad Odds WOE InfoValue _____________ ____ ___ ______ ________ __________ {'[-Inf,33)'} 69 52 1.3269 -0.42156 0.018993 {'[33,37)' } 63 45 1.4 -0.36795 0.012839 {'[37,40)' } 72 47 1.5319 -0.2779 0.0079824 {'[40,46)' } 172 89 1.9326 -0.04556 0.0004549 {'[46,48)' } 59 25 2.36 0.15424 0.0016199 {'[48,51)' } 99 41 2.4146 0.17713 0.0035449 {'[51,58)' } 157 62 2.5323 0.22469 0.0088407 {'[58,Inf]' } 93 25 3.72 0.60931 0.032198 {'<missing>'} 19 11 1.7273 -0.15787 0.00063885 {'Totals' } 803 397 2.0227 NaN 0.087112
biTreated = bininfo(scTreated,'CustAge');
disp(biTreated)
Bin Good Bad Odds WOE InfoValue _____________ ____ ___ ______ ________ _________ {'[-Inf,33)'} 69 52 1.3269 -0.42156 0.018993 {'[33,37)' } 63 45 1.4 -0.36795 0.012839 {'[37,40)' } 72 47 1.5319 -0.2779 0.0079824 {'[40,45)' } 156 86 1.814 -0.10891 0.0024345 {'[45,48)' } 94 39 2.4103 0.17531 0.0033002 {'[48,58)' } 256 103 2.4854 0.20603 0.01223 {'[58,Inf]' } 93 25 3.72 0.60931 0.032198 {'Totals' } 803 397 2.0227 NaN 0.089977
The first few bins are the same, but the treatment of missing values influences the binning results, starting with the bin where the missing data is placed. You can further explore your binning results using autobinning
with a different algorithm or you can manually modify the bins using modifybins
.
For ResStatus
, the results for the treated data look similar to the initial results, except for the higher counts in the 'Home Owner'
bin due to the treatment. For a categorical variable with more categories (or levels), an automatic algorithm may find category groups and the results may show more differences for before and after the treatment. You can further explore your binning results using autobinning
with a different algorithm or you can manually modify the bins using modifybins
.
bi = bininfo(sc,'ResStatus');
disp(bi)
Bin Good Bad Odds WOE InfoValue ______________ ____ ___ ______ _________ __________ {'Tenant' } 296 161 1.8385 -0.095463 0.0035249 {'Home Owner'} 352 171 2.0585 0.017549 0.00013382 {'Other' } 128 52 2.4615 0.19637 0.0055808 {'<missing>' } 27 13 2.0769 0.026469 2.3248e-05 {'Totals' } 803 397 2.0227 NaN 0.0092627
biTreated = bininfo(scTreated,'ResStatus');
disp(biTreated)
Bin Good Bad Odds WOE InfoValue ______________ ____ ___ ______ _________ __________ {'Tenant' } 296 161 1.8385 -0.095463 0.0035249 {'Home Owner'} 379 184 2.0598 0.018182 0.00015462 {'Other' } 128 52 2.4615 0.19637 0.0055808 {'Totals' } 803 397 2.0227 NaN 0.0092603
Fit the logistic model, scale the points, and display the final scorecard.
scTreated = fitmodel(scTreated,'Display','off'); scTreated = formatpoints(scTreated,'PointsOddsAndPDO',[500 2 50]); ScPoints = displaypoints(scTreated); disp(ScPoints)
Predictors Bin Points ______________ _____________________ ______ {'CustAge' } {'[-Inf,33)' } 53.507 {'CustAge' } {'[33,37)' } 55.798 {'CustAge' } {'[37,40)' } 59.646 {'CustAge' } {'[40,45)' } 66.868 {'CustAge' } {'[45,48)' } 79.013 {'CustAge' } {'[48,58)' } 80.326 {'CustAge' } {'[58,Inf]' } 97.559 {'CustAge' } {'<missing>' } NaN {'ResStatus' } {'Tenant' } 62.161 {'ResStatus' } {'Home Owner' } 73.305 {'ResStatus' } {'Other' } 90.777 {'ResStatus' } {'<missing>' } NaN {'EmpStatus' } {'Unknown' } 58.846 {'EmpStatus' } {'Employed' } 86.887 {'EmpStatus' } {'<missing>' } NaN {'CustIncome'} {'[-Inf,29000)' } 29.906 {'CustIncome'} {'[29000,33000)' } 56.219 {'CustIncome'} {'[33000,35000)' } 67.938 {'CustIncome'} {'[35000,40000)' } 70.123 {'CustIncome'} {'[40000,42000)' } 70.931 {'CustIncome'} {'[42000,47000)' } 82.3 {'CustIncome'} {'[47000,Inf]' } 96.647 {'CustIncome'} {'<missing>' } NaN {'TmWBank' } {'[-Inf,12)' } 51.05 {'TmWBank' } {'[12,23)' } 61.018 {'TmWBank' } {'[23,45)' } 61.818 {'TmWBank' } {'[45,71)' } 92.921 {'TmWBank' } {'[71,Inf]' } 133.14 {'TmWBank' } {'<missing>' } NaN {'OtherCC' } {'No' } 50.806 {'OtherCC' } {'Yes' } 75.642 {'OtherCC' } {'<missing>' } NaN {'AMBalance' } {'[-Inf,558.88)' } 89.788 {'AMBalance' } {'[558.88,1254.28)' } 63.088 {'AMBalance' } {'[1254.28,1597.44)'} 59.711 {'AMBalance' } {'[1597.44,Inf]' } 49.157 {'AMBalance' } {'<missing>' } NaN
There are no explicit <missing>
bins in the final scorecard. If you need to score a new data set and it contains missing data, by default the score
function sets the points to NaN
. To further explore the handling of missing data, take a few rows from the original data as test data and introduce some missing data.
tdata = dataTreated(11:14,mdl.PredictorNames); % Keep only the predictors retained in the model % Set some missing values tdata.CustAge(1) = NaN; tdata.ResStatus(2) = '<undefined>'; tdata.EmpStatus(3) = '<undefined>'; tdata.CustIncome(4) = NaN; disp(tdata)
CustAge ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance _______ ___________ ___________ __________ _______ _______ _________ NaN Tenant Unknown 34000 44 Yes 119.8 48 <undefined> Unknown 44000 14 Yes 403.62 65 Home Owner <undefined> 48000 6 No 111.88 44 Other Unknown NaN 35 No 436.41
Score the new data and see how points are set to NaN
, which leads to NaN
scores.
[Scores,Points] = score(scTreated,tdata); disp(Scores)
NaN NaN NaN NaN
disp(Points)
CustAge ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance _______ _________ _________ __________ _______ _______ _________ NaN 62.161 58.846 67.938 61.818 75.642 89.788 80.326 NaN 58.846 82.3 61.018 75.642 89.788 97.559 73.305 NaN 96.647 51.05 50.806 89.788 66.868 90.777 58.846 NaN 61.818 50.806 89.788
For untreated predictors, such as EmpStatus
or CustIncome
, you can use the name-value argument 'Missing'
in formatpoints
to choose how to assign points to missing values.
Use the 'MinPoints'
option for the 'Missing'
argument. This assigns the minimum number of possible points in the scorecard to the missing data. In this example, the minimum number of possible points for CustIncome
is 29.906
, so the last row in the table gets 29.906
points for the missing CustIncome
value.
scTreated = formatpoints(scTreated,'Missing','MinPoints'); [Scores,Points] = score(scTreated,tdata); disp(Scores)
469.7003 510.0812 518.0013 448.8099
disp(Points)
CustAge ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance _______ _________ _________ __________ _______ _______ _________ 53.507 62.161 58.846 67.938 61.818 75.642 89.788 80.326 62.161 58.846 82.3 61.018 75.642 89.788 97.559 73.305 58.846 96.647 51.05 50.806 89.788 66.868 90.777 58.846 29.906 61.818 50.806 89.788
However, for predictors that were treated in the training data, such as CustAge
, the effect of the 'Missing'
argument is inconsistent with the treatment of the training data. For example, for CustAge
, the first observation gets 53.507
points for the missing value, yet if the new data were "treated," and the missing value for CustAge
were replaced with the mode
of the training data (age of 43
), this observation falls in the [40,45) bin and receives 66.868
points.
Therefore, before scoring, data sets must be treated the same way the training data was treated. The use of the 'Missing'
argument is still important to assign points for untreated predictors and the treated predictors receive points in a way that is consistent with the way the model was developed.
tdataTreated = tdata; tdataTreated.CustAge = fillmissing(tdataTreated.CustAge,'constant',modeCustAge); tdataTreated.ResStatus = fillmissing(tdataTreated.ResStatus,'constant',string(modeResStatus)); disp(tdataTreated)
CustAge ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance _______ __________ ___________ __________ _______ _______ _________ 43 Tenant Unknown 34000 44 Yes 119.8 48 Home Owner Unknown 44000 14 Yes 403.62 65 Home Owner <undefined> 48000 6 No 111.88 44 Other Unknown NaN 35 No 436.41
[Scores,Points] = score(scTreated,tdataTreated); disp(Scores)
483.0606 521.2249 518.0013 448.8099
disp(Points)
CustAge ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance _______ _________ _________ __________ _______ _______ _________ 66.868 62.161 58.846 67.938 61.818 75.642 89.788 80.326 73.305 58.846 82.3 61.018 75.642 89.788 97.559 73.305 58.846 96.647 51.05 50.806 89.788 66.868 90.777 58.846 29.906 61.818 50.806 89.788
bininfo
| creditscorecard
| plotbins