lbfgsState
Description
An lbfgsState
object stores information about steps in the
L-BFGS algorithm.
The L-BFGS algorithm [1] is a quasi-Newton method that approximates the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. Use the L-BFGS algorithm for small networks and data sets that you can process in a single batch.
Use lbfgsState
objects in conjunction with the lbfgsupdate
function to train a neural network using the L-BFGS algorithm.
Creation
Description
creates an L-BFGS
state object with a history size of 10 and an initial inverse Hessian factor of 1.solverState
= lbfgsState
sets the solverState
= lbfgsState(Name=Value
)HistorySize
and InitialInverseHessianFactor
properties using one or more name-value
arguments.
Properties
L-BFGS State
HistorySize
— Number of state updates to store
10
(default) | positive integer
Number of state updates to store, specified as a positive integer. Values between 3 and 20 suit most tasks.
The L-BFGS algorithm uses a history of gradient calculations to approximate the Hessian matrix recursively. For more information, see Limited-Memory BFGS.
After creating the lbfgsState
object, this property is
read-only.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
InitialInverseHessianFactor
— Initial value that characterizes approximate inverse Hessian matrix
1
(default) | positive scalar
This property is read-only.
Initial value that characterizes the approximate inverse Hessian matrix, specified as a positive scalar.
To save memory, the L-BFGS algorithm does not store and invert the dense Hessian matrix B. Instead, the algorithm uses the approximation , where m is the history size, the inverse Hessian factor is a scalar, and I is the identity matrix. The algorithm then stores the scalar inverse Hessian factor only. The algorithm updates the inverse Hessian factor at each step.
The initial inverse hessian factor is the value of .
For more information, see Limited-Memory BFGS.
After creating the lbfgsState
object, this property is
read-only.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
InverseHessianFactor
— Value that characterizes approximate inverse Hessian matrix
1
(default) | positive scalar
Value that characterizes the approximate inverse Hessian matrix, specified as a positive scalar.
To save memory, the L-BFGS algorithm does not store and invert the dense Hessian matrix B. Instead, the algorithm uses the approximation , where m is the history size, the inverse Hessian factor is a scalar, and I is the identity matrix. The algorithm then stores the scalar inverse Hessian factor only. The algorithm updates the inverse Hessian factor at each step.
For more information, see Limited-Memory BFGS.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
InitialGradientsNorm
— Norm of initial gradients
[]
(default) | dlarray
scalar
Since R2023b
This property is read-only.
Norm of the initial gradients, specified as a dlarray
scalar or []
.
If the state object is the output of the lbfgsupdate
function, then InitialGradientsNorm
is the first value that the
GradientsNorm
property takes. Otherwise,
InitialGradientsNorm
is []
.
InitialStepSize
— Initial step size
[]
(default) | "auto"
| real finite scalar
Since R2024b
Initial step size, specified as one of these values:
[]
— Do not use an initial step size to determine the initial Hessian approximation."auto"
— Determine the initial step size automatically. The software uses an initial step size of , where W0 are the initial learnable parameters of the network.Positive real scalar — Use the specified value as the initial step size .
If InitialStepSize
is "auto"
or a positive real
scalar, then the software approximates the initial inverse Hessian using , where λ0 is the initial inverse
Hessian factor and denotes the gradients of the loss with respect to the initial learnable
parameters. For more information, see Limited-Memory BFGS.
StepHistory
— Step history
{}
(default) | cell array
Step history, specified as a cell array.
The L-BFGS algorithm uses a history of gradient calculations to approximate the Hessian matrix recursively. For more information, see Limited-Memory BFGS.
Data Types: cell
GradientsDifferenceHistory
— Gradients difference history
{}
(default) | cell array
Gradients difference history, specified as a cell array.
The L-BFGS algorithm uses a history of gradient calculations to approximate the Hessian matrix recursively. For more information, see Limited-Memory BFGS.
Data Types: cell
HistoryIndices
— History indices
0-by-1 vector (default) | row vector
History indices, specified as a row vector.
HistoryIndices
is a 1-by-HistorySize
vector, where StepHistory(i)
and
GradientsDifferenceHistory(i)
correspond to iteration
HistoryIndices(i)
.
For more information, see Limited-Memory BFGS.
Data Types: double
Iteration Information
Loss
— Loss
[]
(default) | dlarray
scalar | numeric scalar
This property is read-only.
Loss, specified as a dlarray
scalar, a numeric scalar, or []
.
If the state object is the output of the lbfgsupdate
function, then Loss
is the first output of the loss function that
you pass to the lbfgsupdate
function. Otherwise, Loss
is []
.
Gradients
— Gradients
[]
(default) | dlarray
object | numeric array | cell array | structure | table
This property is read-only.
Gradients, specified as a dlarray
object, a numeric array, a cell array, a structure, a table, or
[]
.
If the state object is the output of the lbfgsupdate
function, then Gradients
is the second output of the loss
function that you pass to the lbfgsupdate
function. Otherwise, Gradients
is []
.
AdditionalLossFunctionOutputs
— Additional loss function outputs
1-by-0 cell array (default) | cell array
This property is read-only.
Additional loss function outputs, specified as a cell array.
If the state object is the output of the lbfgsupdate
function, then AdditionalLossFunctionOutputs
is a cell array
containing additional outputs of the loss function that you pass to the lbfgsupdate
function. Otherwise, AdditionalLossFunctionOutputs
is a 1-by-0
cell array.
Data Types: cell
StepNorm
— Norm of step
[]
(default) | dlarray
scalar | numeric scalar
This property is read-only.
Norm of the step, specified as a dlarray
scalar, numeric scalar, or []
.
If the state object is the output of the lbfgsupdate
function, then StepNorm
is the norm of the step that the
lbfgsupdate
function calculates. Otherwise, StepNorm
is
[]
.
GradientsNorm
— Norm of gradients
[]
(default) | dlarray
scalar | numeric scalar
This property is read-only.
Norm of the gradients, specified as a dlarray
scalar, a numeric scalar, or []
.
If the state object is the output of the lbfgsupdate
function, then GradientsNorm
is the norm of the second output of
the loss function that you pass to the lbfgsupdate
function. Otherwise, GradientsNorm
is
[]
.
LineSearchStatus
— Status of line search algorithm
""
(default) | "completed"
| "failed"
This property is read-only.
Status of the line search algorithm, specified as ""
,
"completed"
, or "failed"
.
If the state object is the output of the lbfgsupdate
function, then LineSearchStatus
is one of these values:
"completed"
— The algorithm finds a learning rate that satisfies theLineSearchMethod
andMaxNumLineSearchIterations
options that thelbfgsupdate
function uses."failed"
— The algorithm fails to find a learning rate that satisfies theLineSearchMethod
andMaxNumLineSearchIterations
options that thelbfgsupdate
function uses.
Otherwise, LineSearchStatus
is ""
.
LineSearchMethod
— Method solver uses to find suitable learning rate
""
(default) | "weak-wolfe"
| "strong-wolfe"
| "backtracking"
This property is read-only.
Method solver uses to find a suitable learning rate, specified as
"weak-wolfe"
, "strong-wolfe"
,
"backtracking"
, or ""
.
If the state object is the output of the lbfgsupdate
function, then LineSearchMethod
is the line search method that
the lbfgsupdate
function uses. Otherwise, LineSearchMethod
is
""
.
MaxNumLineSearchIterations
— Maximum number of line search iterations
0
(default) | nonnegative integer
This property is read-only.
Maximum number of line search iterations, specified as a nonnegative integer.
If the state object is the output of the lbfgsupdate
function, then MaxNumLineSearchIterations
is the maximum number
of line search iterations that the lbfgsupdate
function uses. Otherwise, MaxNumLineSearchIterations
is
0
.
Data Types: double
Examples
Create L-BFGS Solver State Object
Create an L-BFGS solver state object.
solverState = lbfgsState
solverState = LBFGSState with properties: InverseHessianFactor: 1 StepHistory: {} GradientsDifferenceHistory: {} HistoryIndices: [1x0 double] Iteration Information Loss: [] Gradients: [] AdditionalLossFunctionOutputs: {1x0 cell} GradientsNorm: [] StepNorm: [] LineSearchStatus: "" Show all properties
Update Learnable Parameters in Neural Network
Read the transmission casing data from the CSV file "transmissionCasingData.csv"
.
filename = "transmissionCasingData.csv"; tbl = readtable(filename,TextType="String");
Convert the labels for prediction to categorical using the convertvars
function.
labelName = "GearToothCondition"; tbl = convertvars(tbl,labelName,"categorical");
To train a network using categorical features, convert the categorical predictors to categorical using the convertvars
function by specifying a string array containing the names of all the categorical input variables.
categoricalPredictorNames = ["SensorCondition" "ShaftCondition"]; tbl = convertvars(tbl,categoricalPredictorNames,"categorical");
Loop over the categorical input variables. For each variable, convert the categorical values to one-hot encoded vectors using the onehotencode
function.
for i = 1:numel(categoricalPredictorNames) name = categoricalPredictorNames(i); tbl.(name) = onehotencode(tbl.(name),2); end
View the first few rows of the table.
head(tbl)
SigMean SigMedian SigRMS SigVar SigPeak SigPeak2Peak SigSkewness SigKurtosis SigCrestFactor SigMAD SigRangeCumSum SigCorrDimension SigApproxEntropy SigLyapExponent PeakFreq HighFreqPower EnvPower PeakSpecKurtosis SensorCondition ShaftCondition GearToothCondition ________ _________ ______ _______ _______ ____________ ___________ ___________ ______________ _______ ______________ ________________ ________________ _______________ ________ _____________ ________ ________________ _______________ ______________ __________________ -0.94876 -0.9722 1.3726 0.98387 0.81571 3.6314 -0.041525 2.2666 2.0514 0.8081 28562 1.1429 0.031581 79.931 0 6.75e-06 3.23e-07 162.13 0 1 1 0 No Tooth Fault -0.97537 -0.98958 1.3937 0.99105 0.81571 3.6314 -0.023777 2.2598 2.0203 0.81017 29418 1.1362 0.037835 70.325 0 5.08e-08 9.16e-08 226.12 0 1 1 0 No Tooth Fault 1.0502 1.0267 1.4449 0.98491 2.8157 3.6314 -0.04162 2.2658 1.9487 0.80853 31710 1.1479 0.031565 125.19 0 6.74e-06 2.85e-07 162.13 0 1 0 1 No Tooth Fault 1.0227 1.0045 1.4288 0.99553 2.8157 3.6314 -0.016356 2.2483 1.9707 0.81324 30984 1.1472 0.032088 112.5 0 4.99e-06 2.4e-07 162.13 0 1 0 1 No Tooth Fault 1.0123 1.0024 1.4202 0.99233 2.8157 3.6314 -0.014701 2.2542 1.9826 0.81156 30661 1.1469 0.03287 108.86 0 3.62e-06 2.28e-07 230.39 0 1 0 1 No Tooth Fault 1.0275 1.0102 1.4338 1.0001 2.8157 3.6314 -0.02659 2.2439 1.9638 0.81589 31102 1.0985 0.033427 64.576 0 2.55e-06 1.65e-07 230.39 0 1 0 1 No Tooth Fault 1.0464 1.0275 1.4477 1.0011 2.8157 3.6314 -0.042849 2.2455 1.9449 0.81595 31665 1.1417 0.034159 98.838 0 1.73e-06 1.55e-07 230.39 0 1 0 1 No Tooth Fault 1.0459 1.0257 1.4402 0.98047 2.8157 3.6314 -0.035405 2.2757 1.955 0.80583 31554 1.1345 0.0353 44.223 0 1.11e-06 1.39e-07 230.39 0 1 0 1 No Tooth Fault
Extract the training data.
predictorNames = ["SigMean" "SigMedian" "SigRMS" "SigVar" "SigPeak" "SigPeak2Peak" ... "SigSkewness" "SigKurtosis" "SigCrestFactor" "SigMAD" "SigRangeCumSum" ... "SigCorrDimension" "SigApproxEntropy" "SigLyapExponent" "PeakFreq" ... "HighFreqPower" "EnvPower" "PeakSpecKurtosis" "SensorCondition" "ShaftCondition"]; XTrain = table2array(tbl(:,predictorNames)); numInputFeatures = size(XTrain,2);
Extract the targets and convert them to one-hot encoded vectors.
TTrain = tbl.(labelName); TTrain = onehotencode(TTrain,2); numClasses = size(TTrain,2);
Convert the predictors and targets to dlarray
objects with format "BC"
(batch, channel).
XTrain = dlarray(XTrain,"BC"); TTrain = dlarray(TTrain,"BC");
Define the network architecture.
numHiddenUnits = 32; layers = [ featureInputLayer(numInputFeatures) fullyConnectedLayer(16) layerNormalizationLayer reluLayer fullyConnectedLayer(numClasses) softmaxLayer]; net = dlnetwork(layers);
Define the modelLoss
function, listed in the Model Loss Function section of the example. This function takes as input a neural network, input data, and targets. The function returns the loss and the gradients of the loss with respect to the network learnable parameters.
The lbfgsupdate
function requires a loss function with the syntax [loss,gradients] = f(net)
. Create a variable that parameterizes the evaluated modelLoss
function to take a single input argument.
lossFcn = @(net) dlfeval(@modelLoss,net,XTrain,TTrain);
Initialize an L-BFGS solver state object with a maximum history size of 3 and an initial inverse Hessian approximation factor of 1.1.
solverState = lbfgsState( ... HistorySize=3, ... InitialInverseHessianFactor=1.1);
Train the network a maximum of 200 iterations. Stop training early when the norm of the gradients or steps are smaller than 0.00001. Print the training loss every 10 iterations.
maxIterations = 200; gradientTolerance = 1e-5; stepTolerance = 1e-5; iteration = 0; while iteration < maxIterations iteration = iteration + 1; [net, solverState] = lbfgsupdate(net,lossFcn,solverState); if iteration==1 || mod(iteration,10)==0 fprintf("Iteration %d: Loss: %d\n",iteration,solverState.Loss); end if solverState.GradientsNorm < gradientTolerance || ... solverState.StepNorm < stepTolerance || ... solverState.LineSearchStatus == "failed" break end end
Iteration 1: Loss: 9.343236e-01 Iteration 10: Loss: 4.721475e-01 Iteration 20: Loss: 4.678575e-01 Iteration 30: Loss: 4.666964e-01 Iteration 40: Loss: 4.665921e-01 Iteration 50: Loss: 4.663871e-01 Iteration 60: Loss: 4.662519e-01 Iteration 70: Loss: 4.660451e-01 Iteration 80: Loss: 4.645303e-01 Iteration 90: Loss: 4.591753e-01 Iteration 100: Loss: 4.562556e-01 Iteration 110: Loss: 4.531167e-01 Iteration 120: Loss: 4.489444e-01 Iteration 130: Loss: 4.392228e-01 Iteration 140: Loss: 4.347853e-01 Iteration 150: Loss: 4.341757e-01 Iteration 160: Loss: 4.325102e-01 Iteration 170: Loss: 4.321948e-01 Iteration 180: Loss: 4.318990e-01 Iteration 190: Loss: 4.313784e-01 Iteration 200: Loss: 4.311314e-01
Model Loss Function
The modelLoss
function takes as input a neural network net
, input data X
, and targets T
. The function returns the loss and the gradients of the loss with respect to the network learnable parameters.
function [loss, gradients] = modelLoss(net, X, T) Y = forward(net,X); loss = crossentropy(Y,T); gradients = dlgradient(loss,net.Learnables); end
Algorithms
Limited-Memory BFGS
The L-BFGS algorithm [1] is a quasi-Newton method that approximates the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. Use the L-BFGS algorithm for small networks and data sets that you can process in a single batch.
The algorithm updates learnable parameters W at iteration k+1 using the update step given by
where Wk denotes the weights at iteration k, is the learning rate at iteration k, Bk is an approximation of the Hessian matrix at iteration k, and denotes the gradients of the loss with respect to the learnable parameters at iteration k.
The L-BFGS algorithm computes the matrix-vector product directly. The algorithm does not require computing the inverse of Bk.
To save memory, the L-BFGS algorithm does not store and invert the dense Hessian matrix B. Instead, the algorithm uses the approximation , where m is the history size, the inverse Hessian factor is a scalar, and I is the identity matrix. The algorithm then stores the scalar inverse Hessian factor only. The algorithm updates the inverse Hessian factor at each step.
To compute the matrix-vector product directly, the L-BFGS algorithm uses this recursive algorithm:
Set , where m is the history size.
For :
Let , where and are the step and gradient differences for iteration , respectively.
Set , where is derived from , , and the gradients of the loss with respect to the loss function. For more information, see [1].
Return .
References
[1] Liu, Dong C., and Jorge Nocedal. "On the limited memory BFGS method for large scale optimization." Mathematical programming 45, no. 1 (August 1989): 503-528. https://doi.org/10.1007/BF01589116.
Version History
Introduced in R2023aR2024b: Specify initial step size for L-BFGS solver
Specify the initial step size for the L-BFGS solver using the InitialStepSize
argument.
R2023b: Inspect norm of initial gradients
Inspect the norm of the initial gradients using the InitialGradientsNorm
property.
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)