Main Content

glmfit

Fit generalized linear regression model

    Description

    b = glmfit(X,y,distr) returns a vector b of coefficient estimates for a generalized linear regression model of the responses in y on the predictors in X, using the distribution distr.

    b = glmfit(X,y,distr,Name,Value) specifies additional options using one or more name-value arguments. For example, you can specify 'Constant','off' to omit the constant term from the model.

    example

    [b,dev] = glmfit(___) also returns the value dev, the deviance of the fit.

    example

    [b,dev,stats] = glmfit(___) also returns the model statistics stats.

    Examples

    collapse all

    Fit a generalized linear regression model, and compute predicted (estimated) values for the predictor data using the fitted model.

    Create a sample data set.

    x = [2100 2300 2500 2700 2900 3100 ...
         3300 3500 3700 3900 4100 4300]';
    n = [48 42 31 34 31 21 23 23 21 16 17 21]';
    y = [1 2 0 3 8 8 14 17 19 15 17 21]';

    x contains the predictor variable values. Each y value is the number of successes in the corresponding number of trials in n.

    Fit a probit regression model for y on x.

    b = glmfit(x,[y n],'binomial','Link','probit');

    Compute the estimated number of successes.

    yfit = glmval(b,x,'probit','Size',n);

    Plot the observed success percent and estimated success percent versus the x values.

    plot(x,y./n,'o',x,yfit./n,'-')

    Figure contains an axes object. The axes object contains 2 objects of type line. One or more of the lines displays its values using only markers

    Define a custom link function and use it to fit a generalized linear regression model.

    Load the sample data.

    load fisheriris

    The column vector species contains iris flowers of three different species: setosa, versicolor, and virginica. The matrix meas contains four types of measurements for the flowers, the length and width of sepals and petals in centimeters.

    Define the predictor variables and response variable.

    X = meas(51:end,:);
    y = strcmp('versicolor',species(51:end));

    Define a custom link function for a logit link function. Create three function handles that define the link function, the derivative of the link function, and the inverse link function. Store them in a cell array.

    link = @(mu) log(mu./(1-mu));
    derlink = @(mu) 1./(mu.*(1-mu));
    invlink = @(resp) 1./(1+exp(-resp));
    F = {link,derlink,invlink};

    Fit a logistic regression model using glmfit with the custom link function.

    b = glmfit(X,y,'binomial','link',F)
    b = 5×1
    
       42.6378
        2.4652
        6.6809
       -9.4294
      -18.2861
    
    

    Fit a generalized linear model by using the built-in logit link function, and compare the results.

    b = glmfit(X,y,'binomial','link','logit')
    b = 5×1
    
       42.6378
        2.4652
        6.6809
       -9.4294
      -18.2861
    
    

    Fit a generalized linear regression model that contains an intercept and linear term for each predictor. Perform a deviance test that determines whether the model fits significantly better than a constant model.

    Generate sample data using Poisson random numbers with two underlying predictors X(:,1) and X(:,2).

    rng('default') % For reproducibility
    rndvars = randn(100,2);
    X = [2 + rndvars(:,1),rndvars(:,2)];
    mu = exp(1 + X*[1;2]);
    y = poissrnd(mu);

    Fit a generalized linear regression model that contains an intercept and linear term for each predictor.

    [b,dev] = glmfit(X,y,'poisson');

    The second output argument dev is a Deviance of the fit.

    Fit a generalized linear regression model that contains only an intercept. Specify the predictor variable as a column of 1s, and specify 'Constant' as 'off' so that glmfit does not include a constant term in the model.

    [~,dev_noconstant] = glmfit(ones(100,1),y,'poisson','Constant','off');

    Compute the difference between dev_constant and dev.

    D = dev_noconstant - dev
    D = 
    2.9533e+05
    

    D has a chi-square distribution with 2 degrees of freedom. The degrees of freedom equal the difference in the number of estimated parameters in the model corresponding to dev and the number of estimated parameters in the constant model. Find the p-value for a deviance test.

    p = 1 - chi2cdf(D,2)
    p = 
    0
    

    The small p-value indicates that the model differs significantly from a constant.

    Alternatively, you can create a generalized linear regression model of Poisson data by using the fitglm function. The model display includes the statistic (Chi^2-statistic vs. constant model) and p-value.

    mdl = fitglm(X,y,'y ~ x1 + x2','Distribution','poisson')
    mdl = 
    Generalized linear regression model:
        log(y) ~ 1 + x1 + x2
        Distribution = Poisson
    
    Estimated Coefficients:
                       Estimate       SE        tStat     pValue
                       ________    _________    ______    ______
    
        (Intercept)     1.0405      0.022122    47.034      0   
        x1              0.9968      0.003362    296.49      0   
        x2               1.987     0.0063433    313.24      0   
    
    
    100 observations, 97 error degrees of freedom
    Dispersion: 1
    Chi^2-statistic vs. constant model: 2.95e+05, p-value = 0
    

    You can also use the devianceTest function with the fitted model object.

    devianceTest(mdl)
    ans=2×4 table
                                 Deviance     DFE     chi2Stat     pValue
                                __________    ___    __________    ______
    
        log(y) ~ 1              2.9544e+05    99                         
        log(y) ~ 1 + x1 + x2         107.4    97     2.9533e+05       0  
    
    

    Input Arguments

    collapse all

    Predictor variables, specified as an n-by-p numeric matrix, where n is the number of observations and p is the number of predictor variables. Each column of X represents one variable, and each row represents one observation.

    By default, glmfit includes a constant term in the model. Do not add a column of 1s directly to X. You can change the default behavior of glmfit by specifying the 'Constant' name-value argument.

    Data Types: single | double

    Response variable, specified as a vector or matrix.

    • If distr is not 'binomial', then y must be an n-by-1 vector, where n is the number of observations. Each entry in y is the response for the corresponding row of X. The data type must be single or double.

    • If distr is 'binomial', then y is an n-by-1 vector indicating success or failure at each observation, or an n-by-2 matrix whose first column indicates the number of successes for each observation and second column indicates the number of trials for each observation.

    Data Types: single | double | logical | categorical

    Distribution of the response variable, specified as one of the values in this table.

    ValueDescription
    'normal'Normal distribution (default)
    'binomial'Binomial distribution
    'poisson'Poisson distribution
    'gamma'Gamma distribution
    'inverse gaussian'Inverse Gaussian distribution

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

    Example: b = glmfit(X,y,'normal','link','probit') specifies that the distribution of the response is normal and instructs glmfit to use the probit link function.

    Initial values for the coefficient estimates, specified as a numeric vector. The default values are initial fitted values derived from the input data.

    Data Types: single | double

    Indicator for the constant term (intercept) in the fit, specified as either 'on' to include the constant term or 'off' to remove it from the model.

    • 'on' (default) — glmfit includes a constant term in the model and returns a (p + 1)-by-1 vector of coefficient estimates b, where p is the number of predictors in X. The coefficient of the constant term is the first element of b.

    • 'off'glmfit omits the constant term and returns a p-by-1 vector of coefficient estimates b.

    Example: 'Constant','off'

    Indicator to compute a dispersion parameter for 'binomial' and 'poisson' distributions, specified as 'on' or 'off'.

    ValueDescription
    'on'Estimate a dispersion parameter when computing standard errors. The estimated dispersion parameter value is the sum of squared Pearson residuals divided by the degrees of freedom for error (DFE).
    'off'Use the theoretical value of 1 when computing standard errors (default).

    The fitting function always estimates the dispersion for other distributions.

    Example: 'EstDisp','on'

    Penalty for the likelihood estimate, specified as "none" or "jeffreys-prior".

    • "none"glmfit does not apply a penalty to the likelihood estimate.

    • "jeffreys-prior"glmfit uses Jeffreys prior to penalize the likelihood estimate.

    For logistic models, setting LikelihoodPenalty to "jeffreys-prior" is called Firth's regression. To reduce the coefficient estimate bias when you have a small number of samples, or when you are performing binomial (logistic) regression on a separable data set, set LikelihoodPenalty to "jeffreys-prior".

    Example: LikelihoodPenalty="jeffreys-prior"

    Data Types: char | string

    Link function to use in place of the canonical link function, specified as one of the built-in link functions in the following table or a custom link function.

    Link Function NameLink FunctionMean (Inverse) Function
    'identity' (default for 'normal' distribution)f(μ) = μμ = Xb
    'log' (default for 'poisson' distribution)f(μ) = log(μ)μ = exp(Xb)
    'logit' (default for 'binomial' distribution)f(μ) = log(μ/(1 – μ))μ = exp(Xb) / (1 + exp(Xb))
    'probit'f(μ) = Φ–1(μ), where Φ is the cumulative distribution function of the standard normal distributionμ = Φ(Xb)

    'loglog'

    f(μ) = log(–log(μ))μ = exp(–exp(Xb))
    'comploglog'f(μ) = log(–log(1 – μ))μ = 1 – exp(–exp(Xb))
    'reciprocal' (default for 'gamma' distribution)f(μ) = 1/μμ = 1/(Xb)
    p (a number, default for the 'inverse gaussian' distribution with p = –2)f(μ) = μpμ = Xb1/p

    The default 'Link' value is the canonical link function, which depends on the distribution of the response variable specified by the distr argument.

    You can specify a custom link function using a structure or cell array.

    • Structure with three fields. Each field of the structure (for example, S) holds a function handle that accepts a vector of inputs and returns a vector of the same size:

      • S.Link — Link function, f(μ) = S.Link(μ)

      • S.Derivative — Derivative of the link function

      • S.Inverse — Inverse link function, μ = S.Inverse(Xb)

    • Cell array of the form {FL FD FI} that defines the link function (FL(mu)), the derivative of the link function (FD = dFL(mu)/dmu), and the inverse link function (FI = FL^(–1)). Each entry holds a function handle that accepts a vector of inputs and returns a vector of the same size.

    The link function defines the relationship f(μ) = X*b between the mean response μ and the linear combination of predictors X*b.

    Example: 'Link','probit'

    Data Types: single | double | char | string | struct | cell

    Offset variable in the fit, specified as a numeric vector with the same length as the response y.

    glmfit uses Offset as an additional predictor with a coefficient value fixed at 1. In other words, the formula for fitting is

    f(μ) = Offset + X*b,

    where f is the link function, μ is the mean response, and X*b is the linear combination of predictors X. The Offset predictor has coefficient 1.

    For example, consider a Poisson regression model. Suppose, for theoretical reasons, the number of counts is to be proportional to a predictor A. By using the log link function and specifying log(A) as an offset, you can force the model to satisfy this theoretical constraint.

    Data Types: single | double

    Optimization options, specified as a structure. This argument determines the control parameters for the iterative algorithm that glmfit uses.

    Create the 'Options' value by using the function statset or by creating a structure array containing the fields and values described in this table.

    Field NameValueDefault Value
    Display

    Amount of information displayed by the algorithm

    • 'off' — Displays no information

    • 'final' — Displays the final output

    'off'
    MaxIter

    Maximum number of iterations allowed, specified as a positive integer

    100
    TolX

    Termination tolerance for the parameters, specified as a positive scalar

    1e-6

    You can also enter statset('glmfit') in the Command Window to see the names and default values of the fields that glmfit accepts in the 'Options' name-value argument.

    Example: 'Options',statset('Display','final','MaxIter',1000) specifies to display the final information of the iterative algorithm results, and change the maximum number of iterations allowed to 1000.

    Data Types: struct

    Observation weights, specified as an n-by-1 vector of nonnegative scalar values, where n is the number of observations.

    Data Types: single | double

    Output Arguments

    collapse all

    Coefficient estimates, returned as a numeric vector.

    • If 'Constant' is 'on' (default), then glmfit includes a constant term in the model and returns a (p + 1)-by-1 vector of coefficient estimates b, where p is the number of predictors in X. The coefficient of the constant term is the first element of b.

    • If 'Constant' is 'off', then glmfit omits the constant term and returns a p-by-1 vector of coefficient estimates b.

    Deviance of the fit, returned as a numeric value. The deviance is useful for comparing two models when one model is a special case of the other model. The difference between the deviance of the two models has a chi-square distribution with degrees of freedom equal to the difference in the number of estimated parameters between the two models.

    For more information, see Deviance.

    Model statistics, returned as a structure with the following fields:

    • beta — Coefficient estimates b

    • dfe — Degrees of freedom for error

    • sfit — Estimated dispersion parameter

    • s — Theoretical or estimated dispersion parameter

    • estdisp — 0 when 'EstDisp' is 'off' and 1 when 'EstDisp' is 'on'

    • covb — Estimated covariance matrix for b

    • se — Vector of standard errors of the coefficient estimates b

    • coeffcorr — Correlation matrix for b

    • tt statistics for b

    • pp-values for b

    • resid — Vector of residuals

    • residp — Vector of Pearson residuals

    • residd — Vector of deviance residuals

    • resida — Vector of Anscombe residuals

    If you estimate a dispersion parameter for the binomial or Poisson distribution, then stats.s is equal to stats.sfit. Also, the elements of stats.se differ by the factor stats.s from their theoretical values.

    More About

    collapse all

    Deviance

    Deviance is a generalization of the residual sum of squares. It measures the goodness of fit compared to a saturated model.

    The deviance of a model M1 is twice the difference between the loglikelihood of the model M1 and the saturated model Ms. A saturated model is a model with the maximum number of parameters that you can estimate.

    For example, if you have n observations (yi, i = 1, 2, ..., n) with potentially different values for XiTβ, then you can define a saturated model with n parameters. Let L(b,y) denote the maximum value of the likelihood function for a model with the parameters b. Then the deviance of the model M1 is

    2(logL(b1,y)logL(bS,y)),

    where b1 and bs contain the estimated parameters for the model M1 and the saturated model, respectively. The deviance has a chi-squared distribution with np degrees of freedom, where n is the number of parameters in the saturated model and p is the number of parameters in the model M1.

    Assume you have two different generalized linear regression models M1 and M2, and M1 has a subset of the terms in M2. You can assess the fit of the models by comparing their deviances D1 and D2. The difference of the deviances is

    D=D2D1=2(logL(b2,y)logL(bS,y))+2(logL(b1,y)logL(bS,y))=2(logL(b2,y)logL(b1,y)).

    Asymptotically, the difference D has a chi-squared distribution with degrees of freedom v equal to the difference in the number of parameters estimated in M1 and M2. You can obtain the p-value for this test by using 1  —  chi2cdf(D,v).

    Typically, you examine D using a model M2 with a constant term and no predictors. Therefore, D has a chi-squared distribution with p – 1 degrees of freedom. If the dispersion is estimated, the difference divided by the estimated dispersion has an F distribution with p – 1 numerator degrees of freedom and np denominator degrees of freedom.

    Tips

    • glmfit treats NaNs in X or y as missing values and ignores them.

    Alternative Functionality

    glmfit is useful when you simply need the output arguments of the function or when you want to repeat fitting a model multiple times in a loop. If you need to investigate a fitted model further, create a generalized linear regression model object GeneralizedLinearModel by using fitglm or stepwiseglm. A GeneralizedLinearModel object provides more features than glmfit.

    • Use the properties of GeneralizedLinearModel to investigate a fitted model. The object properties include information about the coefficient estimates, summary statistics, fitting method, and input data.

    • Use the object functions of GeneralizedLinearModel to predict responses and to modify, evaluate, and visualize the generalized linear regression model.

    • You can find the information in the output of glmfit using the properties and object functions of GeneralizedLinearModel.

      Output of glmfitEquivalent Values in GeneralizedLinearModel
      bSee the Estimate column of the Coefficients property.
      devSee the Deviance property.
      stats

      See the model display in the Command Window. You can find the statistics in the model properties (CoefficientCovariance, Coefficients, Dispersion, DispersionEstimated, and Residuals).

      The dispersion parameter in stats.s of glmfit is the scale factor for the standard errors of coefficients, whereas the dispersion parameter in the Dispersion property of a generalized linear model is the scale factor for the variance of the response. Therefore, stats.s is the square root of the Dispersion value.

    References

    [1] Dobson, A. J. An Introduction to Generalized Linear Models. New York: Chapman & Hall, 1990.

    [2] McCullagh, P., and J. A. Nelder. Generalized Linear Models. New York: Chapman & Hall, 1990.

    [3] Collett, D. Modeling Binary Data. New York: Chapman & Hall, 2002.

    Extended Capabilities

    Version History

    Introduced before R2006a