Main Content

modelCalibration

Compute RMSE of predicted and observed PDs on grouped data

Since R2023a

Description

CalMeasure = modelCalibration(pdModel,data,GroupBy) computes the root mean squared error (RMSE) of the observed compared to the predicted probabilities of default (PD). GroupBy is required and can be any column in the data input (not necessarily a model variable). The modelCalibration function computes the observed PD as the default rate of each group and the predicted PD as the average PD for each group. modelCalibration supports comparison against a reference model.

example

[CalMeasure,CalData] = modelCalibration(___,Name,Value) specifies options using one or more name-value pair arguments in addition to the input arguments in the previous syntax.

example

Examples

collapse all

This example shows how to use fitLifetimePDModel to fit data with a Logistic model and then use modelCalibration to compute the root mean squared error (RMSE) of the observed probabilities of default (PDs) with respect to the predicted PDs.

Load Data

Load the credit portfolio data.

load RetailCreditPanelData.mat
disp(head(data))
    ID    ScoreGroup    YOB    Default    Year
    __    __________    ___    _______    ____

    1      Low Risk      1        0       1997
    1      Low Risk      2        0       1998
    1      Low Risk      3        0       1999
    1      Low Risk      4        0       2000
    1      Low Risk      5        0       2001
    1      Low Risk      6        0       2002
    1      Low Risk      7        0       2003
    1      Low Risk      8        0       2004
disp(head(dataMacro))
    Year     GDP     Market
    ____    _____    ______

    1997     2.72      7.61
    1998     3.57     26.24
    1999     2.86      18.1
    2000     2.43      3.19
    2001     1.26    -10.51
    2002    -0.59    -22.95
    2003     0.63      2.78
    2004     1.85      9.48

Join the two data components into a single data set.

data = join(data,dataMacro);
disp(head(data))
    ID    ScoreGroup    YOB    Default    Year     GDP     Market
    __    __________    ___    _______    ____    _____    ______

    1      Low Risk      1        0       1997     2.72      7.61
    1      Low Risk      2        0       1998     3.57     26.24
    1      Low Risk      3        0       1999     2.86      18.1
    1      Low Risk      4        0       2000     2.43      3.19
    1      Low Risk      5        0       2001     1.26    -10.51
    1      Low Risk      6        0       2002    -0.59    -22.95
    1      Low Risk      7        0       2003     0.63      2.78
    1      Low Risk      8        0       2004     1.85      9.48

Partition Data

Separate the data into training and test partitions.

nIDs = max(data.ID);
uniqueIDs = unique(data.ID);

rng('default'); % For reproducibility
c = cvpartition(nIDs,'HoldOut',0.4);

TrainIDInd = training(c);
TestIDInd = test(c);

TrainDataInd = ismember(data.ID,uniqueIDs(TrainIDInd));
TestDataInd = ismember(data.ID,uniqueIDs(TestIDInd));

Create Logistic Lifetime PD Model

Use fitLifetimePDModel to create a Logistic model using the training data.

pdModel = fitLifetimePDModel(data(TrainDataInd,:),"Logistic",...
    'AgeVar','YOB',...
    'IDVar','ID',...
    'LoanVars','ScoreGroup',...
    'MacroVars',{'GDP','Market'},...
    'ResponseVar','Default');
 disp(pdModel)
  Logistic with properties:

            ModelID: "Logistic"
        Description: ""
    UnderlyingModel: [1x1 classreg.regr.CompactGeneralizedLinearModel]
              IDVar: "ID"
             AgeVar: "YOB"
           LoanVars: "ScoreGroup"
          MacroVars: ["GDP"    "Market"]
        ResponseVar: "Default"
         WeightsVar: ""
       TimeInterval: 1

Display the underlying model.

pdModel.UnderlyingModel
ans = 
Compact generalized linear regression model:
    logit(Default) ~ 1 + ScoreGroup + YOB + GDP + Market
    Distribution = Binomial

Estimated Coefficients:
                               Estimate        SE         tStat       pValue   
                              __________    _________    _______    ___________

    (Intercept)                  -2.7422      0.10136    -27.054     3.408e-161
    ScoreGroup_Medium Risk      -0.68968     0.037286    -18.497     2.1894e-76
    ScoreGroup_Low Risk          -1.2587     0.045451    -27.693    8.4736e-169
    YOB                         -0.30894     0.013587    -22.738    1.8738e-114
    GDP                         -0.11111     0.039673    -2.8006      0.0051008
    Market                    -0.0083659    0.0028358    -2.9502      0.0031761


388097 observations, 388091 error degrees of freedom
Dispersion: 1
Chi^2-statistic vs. constant model: 1.85e+03, p-value = 0

Compute Model Calibration

Model calibration measures the predicted probabilities of default. For example, if the model predicts a 10% PD for a group, does the group end up showing an approximate 10% default rate, or is the eventual rate much higher or lower? While model discrimination measures the risk ranking only, model calibration measures the predicted risk levels.

modelCalibration computes the root mean squared error (RMSE) of the observed PDs with respect to the predicted PDs. A grouping variable is required and it can be any column in the data input (not necessarily a model variable). The modelCalibration function computes the observed PD as the default rate of each group and the predicted PD as the average PD for each group.

DataSetChoice = "Training";
 if DataSetChoice=="Training"
    Ind = TrainDataInd;
else
   Ind = TestDataInd;
 end

GroupingVar = "YOB";
[CalMeasure,CalData] = modelCalibration(pdModel,data(Ind,:),GroupingVar,DataID=DataSetChoice)
CalMeasure=table
                                            RMSE   
                                          _________

    Logistic, grouped by YOB, Training    0.0004142

CalData=16×5 table
     ModelID      YOB       PD        GroupCount    WeightedCount
    __________    ___    _________    __________    _____________

    "Observed"     1      0.017421      58092           58092    
    "Observed"     2      0.012305      56723           56723    
    "Observed"     3      0.011382      55524           55524    
    "Observed"     4      0.010741      54650           54650    
    "Observed"     5       0.00809      53770           53770    
    "Observed"     6     0.0066747      53186           53186    
    "Observed"     7     0.0032198      36959           36959    
    "Observed"     8     0.0018757      19193           19193    
    "Logistic"     1      0.017185      58092           58092    
    "Logistic"     2      0.012791      56723           56723    
    "Logistic"     3       0.01131      55524           55524    
    "Logistic"     4      0.010615      54650           54650    
    "Logistic"     5     0.0083982      53770           53770    
    "Logistic"     6     0.0058744      53186           53186    
    "Logistic"     7     0.0035872      36959           36959    
    "Logistic"     8     0.0023689      19193           19193    

Visualize the model calibration using modelCalibrationPlot.

modelCalibrationPlot(pdModel,data(Ind,:),GroupingVar,DataID=DataSetChoice);

Figure contains an axes object. The axes object with title Scatter Grouped by YOB Training Logistic, RMSE = 0.0004142, xlabel YOB, ylabel PD contains 2 objects of type line. One or more of the lines displays its values using only markers These objects represent Observed, Logistic.

You can use more than one variable for grouping. For this example, group by the variables YOB and ScoreGroup.

CalMeasure = modelCalibration(pdModel,data(Ind,:),["YOB","ScoreGroup"],DataID=DataSetChoice);
disp(CalMeasure)
                                                         RMSE   
                                                      __________

    Logistic, grouped by YOB, ScoreGroup, Training    0.00066239

Now visualize the two grouping variables using modelCalibrationPlot.

modelCalibrationPlot(pdModel,data(Ind,:),["YOB","ScoreGroup"],DataID=DataSetChoice);

Figure contains an axes object. The axes object with title Scatter Grouped by YOB and ScoreGroup Training Logistic, RMSE = 0.00066239, xlabel YOB, ylabel PD contains 6 objects of type line. One or more of the lines displays its values using only markers These objects represent High Risk, Observed, Medium Risk, Observed, Low Risk, Observed, High Risk, Logistic, Medium Risk, Logistic, Low Risk, Logistic.

Input Arguments

collapse all

Probability of default model, specified as a previously created Logistic, Probit, or Cox object using fitLifetimePDModel. Alternatively, you can create a custom probability of default model using customLifetimePDModel.

Data Types: object

Data, specified as a NumRows-by-NumCols table with projected predictor values to make lifetime predictions. The predictor names and data types must be consistent with the underlying model.

Data Types: table

Name of column in the data input used to group the data, specified as a string or character vector. GroupBy does not have to be a model variable name. For each group designated by GroupBy, the modelCalibration function computes the observed default rates and average predicted PDs are computed to measure the RMSE.

Data Types: string | char

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: [CalMeasure,CalData] = modelCalibration(pdModel,data(Ind,:),GroupBy=["YOB","ScoreGroup"],DataID="DataSetChoice")

Data set identifier, specified as DataID and a character vector or string. DataID is included in the modelCalibration output for reporting purposes.

Data Types: char | string

Conditional PD values predicted for data by the reference model, specified as ReferencePD and a NumRows-by-1 numeric vector. The function reports the modelCalibration output information for both the pdModel object and the reference model.

Data Types: double

Identifier for the reference model, specified as ReferenceID and a character vector or string. ReferenceID is used in the modelCalibration output for reporting purposes.

Data Types: char | string

Output Arguments

collapse all

Calibration measure, returned as a single-column table of RMSE values.

This table has one row if only the pdModel accuracy is measured and it has two rows if reference model information is given. The row names of CalMeasure report the model IDs, grouping variables, and data ID.

Note

The reported RMSE values depend on the grouping variable for the required GroupBy argument.

Calibration data, returned as a table of observed and predicted PD values for each group.

The reported observed PD values correspond to the observed default rate for each group. The reported predicted PD values are the average PD values predicted by the pdModel object for each group, and similarly for the reference model. The modelCalibration function stacks the PD data, placing the observed values for all groups first, then the predicted PDs for the pdModel, and then the predicted PDs for the reference model, if given.

The column 'ModelID' identifies which rows correspond to the observed PD, pdModel, or reference model. The table also has one column for each grouping variable showing the unique combinations of grouping values. The 'PD' column of CalData is a the PD data. The 'GroupCount' column of CalData is the group counts data. The last column of CalData is the WeightedCount.

More About

collapse all

Model Calibration

Model calibration measures the accuracy of the predicted probability of default (PD) values.

To measure model calibration, you must compare the predicted PD values to the observed default rates. For example, if a group of customers is predicted to have an average PD of 5%, then is the observed default rate for that group close to 5%?

The modelCalibration function requires a grouping variable to compute average predicted PD values within each group and the average observed default rate also within each group. modelCalibration uses the root mean squared error (RMSE) to measure the deviations between the observed and predicted values across groups. For example, the grouping variable could be the calendar year, so that rows corresponding to the same calendar year are grouped together. Then, for each year the software computes the observed default rate and the average predicted PD. The modelCalibration function then applies the RMSE formula to obtain a single measure of the prediction error across all years in the sample.

Suppose there are N observations in the data set, and there are M groups G1,...,GM. The default rate for group Gi is

DRi=DiNi

where:

Di is the number of defaults observed in group Gi.

Ni is the number of observations in group Gi.

The average predicted probability of default PDi for group Gi is

PDi=1NijGiPD(j)

where PD(j) is the probability of default for observation j. In other words, this is the average of the predicted PDs within group Gi.

Therefore, the RMSE is computed as

RMSE=i=1M(NiN)(DRiPDi)2

The RMSE, as defined, depends on the selected grouping variable. For example, grouping by calendar year and grouping by years-on-books might result in different RSME values.

Use modelCalibrationPlot to visualize observed default rates and predicted PD values on grouped data.

References

[1] Baesens, Bart, Daniel Roesch, and Harald Scheule. Credit Risk Analytics: Measurement Techniques, Applications, and Examples in SAS. Wiley, 2016.

[2] Bellini, Tiziano. IFRS 9 and CECL Credit Risk Modelling and Validation: A Practical Guide with Examples Worked in R and SAS. San Diego, CA: Elsevier, 2019.

[3] Breeden, Joseph. Living with CECL: The Modeling Dictionary. Santa Fe, NM: Prescient Models LLC, 2018.

[4] Roesch, Daniel and Harald Scheule. Deep Credit Risk: Machine Learning with Python. Independently published, 2020.

Version History

Introduced in R2023a

expand all

Go to top of page