Main Content

Train Reinforcement Learning Policy Using Custom Training Loop

This example shows how to define a custom training loop for a reinforcement learning policy. You can use this workflow to train reinforcement learning policies with your own custom training algorithms rather than using one of the built-in agents from the Reinforcement Learning Toolbox™ software.

Using this workflow, you can train policies that use any of the following policy and value function approximators.

In this example, a discrete actor policy with a discrete action space is trained using the REINFORCE algorithm (with no baseline). For more information on the REINFORCE algorithm, see Policy Gradient (PG) Agents.

Fix the random generator seed for reproducibility.


For more information on the functions you can use for custom training, see Functions for Custom Training.


For this example, a reinforcement learning policy is trained in a discrete cart-pole environment. The objective in this environment is to balance the pole by applying forces (actions) on the cart. Create the environment using the rlPredefinedEnv function.

env = rlPredefinedEnv("CartPole-Discrete");

Extract the observation and action specifications from the environment.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Obtain the dimension of the observation space (numObs) and the number of possible actions (numAct).

numObs = obsInfo.Dimension(1);
numAct = actInfo.Dimension(1);

For more information on this environment, see Load Predefined Control System Environments.


The reinforcement learning policy in this example is a discrete-action stochastic policy. It is modeled by a deep neural network that contains fullyConnectedLayer, reluLayer, and softmaxLayer layers. This network outputs probabilities for each discrete action given the current observations. The softmaxLayer ensures that the actor outputs probability values in the range [0 1] and that all probabilities sum to 1.

Create the deep neural network for the actor.

actorNetwork = [

Convert to dlnetwork.

actorNetwork = dlnetwork(actorNetwork);

Create the actor using an rlDiscreteCategoricalActor object.

actor = rlDiscreteCategoricalActor(actorNetwork,obsInfo,actInfo);

Accelerate the gradient computation of the actor.

actor = accelerate(actor,true);

Evaluate the policy with a random observation as input.

policyEvalOutCell = evaluate(actor,{rand(obsInfo.Dimension)});
policyEvalOut = policyEvalOutCell{1}
policyEvalOut = 2x1 single column vector


Create the optimizer using rlOptimizer and rlOptimizerOptions function.

actorOpts = rlOptimizerOptions(LearnRate=1e-2);
actorOptimizer = rlOptimizer(actorOpts);

Training Setup

Configure the training to use the following options:

  • Set up the training to last at most 5000 episodes, with each episode lasting at most 250 steps.

  • To calculate the discounted reward, choose a discount factor of 0.995.

  • Terminate the training after the maximum number of episodes is reached or when the average reward across 100 episodes reaches the value of 220.

numEpisodes = 5000;
maxStepsPerEpisode = 250;
discountFactor = 0.995;
avgWindowSize = 100;
trainingTerminationValue = 220;

Create a vector to store the cumulative reward for each training episode.

episodeCumulativeRewardVector = [];

Create a figure for training visualization using the hBuildFigure helper function.

[trainingPlot,lineReward,lineAveReward] = hBuildFigure;

Custom Training Loop

The algorithm for the custom training loop is as follows. For each episode:

  1. Reset the environment.

  2. Create buffers for storing experience information: observations, actions, and rewards.

  3. Generate experiences until a terminal condition occurs. To do so, evaluate the policy to get actions, apply those actions to the environment, and obtain the resulting observations and rewards. Store the actions, observations, and rewards in buffers.

  4. Collect the training data as a batch of experiences.

  5. Compute the episode Monte Carlo return, which is the discounted future reward.

  6. Compute the gradient of the loss function with respect to the policy parameters.

  7. Update the policy using the computed gradients.

  8. Update the training visualization.

  9. Terminate training if the policy is sufficiently trained.

% Enable the training visualization plot.

% Train the policy for the maximum number of episodes 
% or until the average reward indicates that the policy 
% is sufficiently trained.
for episodeCt = 1:numEpisodes
    % 1. Reset the environment at the start of the episode

    obs = reset(env);    
    episodeReward = zeros(maxStepsPerEpisode,1);
    % 2. Create buffers to store experiences. 
    % The dimensions for each buffer must be as follows.
    % For the observation buffer: 
    %  numberOfObservations x ...
    %  numberOfObservationChannels x ...
    %  batchSize
    % For action buffer: 
    %     numberOfActions x ...
    %     numberOfActionChannels x ...
    %     batchSize
    % For reward buffer: 
    %     1 x batchSize
    observationBuffer = zeros(numObs,1,maxStepsPerEpisode);
    actionBuffer = zeros(numAct,1,maxStepsPerEpisode);
    rewardBuffer = zeros(1,maxStepsPerEpisode);
    % 3. Generate experiences 
    %    for the maximum number of steps per episode 
    %    or until a terminal condition is reached.
    for stepCt = 1:maxStepsPerEpisode
        % Compute an action using the policy 
        % based on the current observation.
        action = getAction(actor,{obs});
        % Apply the action to the environment 
        % and obtain the resulting observation and reward.
        [nextObs,reward,isdone] = step(env,action{1});
        % Store the action, observation, 
        % and reward experiences in their buffers.
        observationBuffer(:,:,stepCt) = obs;
        actionBuffer(:,:,stepCt) = action{1};
        rewardBuffer(:,stepCt) = reward;
        episodeReward(stepCt) = reward;
        obs = nextObs;
        % Stop if a terminal condition is reached.
        if isdone
    % 4. Create training data. 
    % Training is performed using batch data. 
    % The batch size cannot exceed the length of the episode.
    batchSize = min(stepCt,maxStepsPerEpisode);
    observationBatch = observationBuffer(:,:,1:batchSize);
    actionBatch = actionBuffer(:,:,1:batchSize);
    rewardBatch = rewardBuffer(:,1:batchSize);

    % Compute the discounted future reward.
    discountedReturn = zeros(1,batchSize);
    for t = 1:batchSize
        G = 0;
        for k = t:batchSize
            G = G + discountFactor ^ (k-t) * rewardBatch(k);
        discountedReturn(t) = G;

    % 5. Organize data to pass to the loss function.
    lossData.batchSize = batchSize;
    lossData.actInfo = actInfo;
    lossData.actionBatch = actionBatch;
    lossData.discountedReturn = discountedReturn;
    % 6. Compute the gradient of the loss 
    %    with respect to the policy parameters.
    actorGradient = gradient(actor,@actorLossFunction,...
    % 7. Update the actor network using the computed gradients.
    % for more information, at the command line, type:
    % help rl.optimizer.AbstractOptimizer/update
    [actor,actorOptimizer] = update( ...
        actorOptimizer, ...
        actor, ...

    % 8. Update the training visualization.
    episodeCumulativeReward = sum(episodeReward);
    episodeCumulativeRewardVector = cat(2,...
    movingAvgReward = movmean(episodeCumulativeRewardVector,...
    % 9. Terminate training if the network is sufficiently trained.
    if max(movingAvgReward) > trainingTerminationValue

Figure Cart Pole Custom Training contains an axes object. The axes object with title Training Progress, xlabel Episode, ylabel Reward contains 2 objects of type animatedline. These objects represent Cumulative Reward, Average Reward.


After training, simulate the trained policy.

Before simulation, reset the environment.

obs = reset(env);

Enable the environment visualization, which is updated each time the environment step function is called.


For each simulation step, perform the following actions.

  1. Get the action by sampling from the policy using the getAction function.

  2. Step the environment using the obtained action value.

  3. Terminate if a terminal condition is reached.

for stepCt = 1:maxStepsPerEpisode
    % Select action according to trained policy
    action = getAction(actor,{obs});
    % Step the environment
    [nextObs,reward,isdone] = step(env,action{1});
    % Check for terminal condition
    if isdone
    obs = nextObs;

Figure Cart Pole Visualizer contains an axes object. The axes object contains 6 objects of type line, polygon.

Functions for Custom Training

To obtain actions and value functions for given observations from Reinforcement Learning Toolbox policy and value function approximators, you can use the following functions.

  • getValue — Obtain the estimated state value or state-action value function.

  • getAction — Obtain the action from an actor based on the current observation.

  • getMaxQValue — Obtain the estimated maximum state-action value function for a discrete Q-value approximator.

If your policy or value function approximator is a recurrent neural network, that is, a neural network with at least one layer that has hidden state information, the preceding functions can return the current network state. You can use the following function syntaxes to get and set the state of your approximator.

  • state = getState(critic) — Obtain the state of approximator critic.

  • newCritic = setState(oldCritic,state) — Set the state of approximator newCritic, and return the result in oldCritic.

  • newCritic = resetState(oldCritic) — Reset all state values of oldCritic to zero and return the result in newCritic.

You can get and set the learnable parameters of your approximator using the getLearnableParameters and setLearnableParameters function, respectively.

In addition to these functions, you can use the gradient, optimize, and syncParameters functions to set parameters and compute gradients for your policy and value function approximators.


The gradient function computes the gradients of the approximator loss function. You can compute several different gradients. For example, to compute the gradient of the sum of the approximator outputs with respect to its inputs, use the following syntax.

grad = gradient(actor,"output-input",inputData)


  • actor is a policy or value function approximator object.

  • inputData contains values for the input channels to the approximator (e.g. an observation).

  • grad contains the computed gradients.

For more information, see gradient.


The syncParameters function updates the learnable parameters of one policy or value function approximator based on those of another approximator. This function is useful for updating a target actor or critic approximator, as is done for DDPG agents. To synchronize parameters values between two approximators, use the following syntax.

newTargetApproximator = syncParameters(
   oldTargetApproximator, ...
   sourceApproximator, ...


  • oldTargetApproximator is a policy or value function approximator object with parameters θold.

  • sourceApproximator is a policy or value function approximator object with the same structure as oldTargetRep, but with parameters θsource.

  • smoothFactor is a smoothing factor (τ) for the update.

  • newTargetApproximator has the same structure as oldRep, but its parameters are θnew=τθsource+(1-τ)θold.

For more information, at the MATLAB command line, type help rl.function.AbstractFunction.syncParameters

Loss Function

The loss function in the REINFORCE algorithm the product between the discounted reward and the logarithm of the probability distribution of the action (coming from the policy evaluation for a given observation), summed across all time steps. The discounted reward calculated in the custom training loop must be resized to be multiplied with the logarithm of the action probability distribution.

The function first input parameter must be a cell array like the one returned from the evaluation of a function approximator object. For more information, see the description of outData in evaluate. The second, optional, input argument contains additional data that might be needed by the loss calculation function. For more information, see gradient.

function loss = actorLossFunction(ActProbCell,lossFcnStruct)

    % Extract the matrix resulting from the policy evaluation 
    ActProb = ActProbCell{1};

    % Create the action indication matrix.
    batchSize = lossFcnStruct.batchSize;
    Z = repmat(lossFcnStruct.actInfo.Elements',1,batchSize);
    actionIndicationMatrix = (lossFcnStruct.actionBatch(:,:)==Z);
    % Resize the discounted return to the size of ActProb.
    G = actionIndicationMatrix .* lossFcnStruct.discountedReturn;
    G = reshape(G,size(ActProb));
    % Round any action probability values less than eps to eps.
    ActProb(ActProb < eps) = eps;
    % Compute the loss.
    loss = -sum(G .* log(ActProb),"all");

Helper Function

The following helper function creates a figure for training visualization.

function [trainingPlt, lineRewd, lineAvgRwd] = hBuildFigure()
    plotRatio = 16/9;
    trainingPlt = figure(...
                HandleVisibility="off", ...
                Name="Cart Pole Custom Training");

    trainingPlt.Position(3) = ...
         plotRatio * trainingPlt.Position(4);
    ax = gca(trainingPlt);
    lineRewd = animatedline(ax);
    lineAvgRwd = animatedline(ax,Color="r",LineWidth=3);
    legend(ax,"Cumulative Reward","Average Reward", ...
    title(ax,"Training Progress");

See Also



Related Examples

More About