## Twin-Delayed Deep Deterministic (TD3) Policy Gradient Agent

The twin-delayed deep deterministic (TD3) policy gradient algorithm is an off-policy actor-critic method for environments with a continuous action-space. A TD3 agent learns a deterministic policy while also using two Q-value function critics to estimate the value of the optimal policy. It features a target actor and target critics as well as an experience buffer. TD3 agents supports offline training (training from saved data, without an environment).

The TD3 algorithm is an extension of the DDPG algorithm. DDPG agents can overestimate value functions, which can produce suboptimal policies. To reduce value function overestimation, the TD3 algorithm includes the following modifications of the DDPG algorithm.

A TD3 agent learns two Q-value functions and uses the minimum value function estimate during policy updates.

A TD3 agent updates the policy and targets less frequently than the Q functions.

When updating the policy, a TD3 agent adds noise to the target action, which makes the policy less likely to exploit actions with high Q-value estimates.

You can use a TD3 agent to implement one of the following training algorithms, depending on the number of critics you specify.

TD3 — Train the agent with two Q-value functions. This algorithm implements all three of the preceding modifications.

Delayed DDPG — Train the agent with a single Q-value function. This algorithm trains a DDPG agent with target policy smoothing and delayed policy and target updates.

For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

In Reinforcement Learning Toolbox™, a TD3 agent is implemented by an `rlTD3Agent`

object.

TD3 agents can be trained in environments with the following observation and action spaces.

Observation Space | Action Space |
---|---|

Continuous or discrete | Continuous |

TD3 agents use the following actor and critics.

Critics | Actor |
---|---|

One or more Q-value function critics
| Deterministic policy actor |

During training, a TD3 agent:

Updates the actor and critic learnable parameters at each time step during learning.

Stores past experiences using a circular experience buffer. The agent updates the actor and critic using a mini-batch of experiences randomly sampled from the buffer.

Perturbs the action chosen by the policy using a stochastic noise model at each training step.

### Actor and Critic Functions

To estimate the policy and value function, a TD3 agent maintains the following function approximators:

Deterministic actor

*π*(*S*;*θ*) — The actor, with parameters*θ*, takes observation*S*and returns the corresponding action that maximizes the long-term reward.Target actor

*π*(_{t}*S*;*θ*) — To improve the stability of the optimization, the agent periodically updates the target actor learnable parameters_{t}*θ*using the latest actor parameter values._{t}One or two Q-value critics

*Q*(_{k}*S*,*A*;*ϕ*) — The critics, each with different parameters_{k}*ϕ*, take observation_{k}*S*and action*A*as inputs and returns the corresponding expectation of the long-term reward.One or two target critics

*Q*(_{tk}*S*,*A*;*ϕ*) — To improve the stability of the optimization, the agent periodically updates the target critic learnable parameters_{tk}*ϕ*using the latest corresponding critic parameter values. The number of target critics matches the number of critics._{tk}

Both *π*(*S*;*θ*) and
*π _{t}*(

*S*;

*θ*) have the same structure and parameterization.

_{t}For each critic,
*Q _{k}*(

*S*,

*A*;

*ϕ*) and

_{k}*Q*(

_{tk}*S*,

*A*;

*ϕ*) have the same structure and parameterization.

_{tk}When using two critics,
*Q*_{1}(*S*,*A*;*ϕ*_{1})
and
*Q*_{2}(*S*,*A*;*ϕ*_{2}),
each critic can have a different structure, though TD3 works best when the critics have the
same structure. When the critics have the same structure, they must have different initial
parameter values.

For more information on creating actors and critics for function approximation, see Create Policies and Value Functions.

During training, the agent tunes the parameter values in *θ*. After
training, the parameters remain at their tuned value and the trained actor function
approximator is stored in *π*(*S*).

### Agent Creation

You can create and train TD3 agents at the MATLAB^{®} command line or using the Reinforcement Learning Designer app. For
more information on creating agents using Reinforcement Learning Designer, see
Create Agents Using Reinforcement Learning Designer.

At the command line, you can create a TD3 agent with default actor and critics based on the observation and action specifications from the environment. To do so, perform the following steps.

Create observation specifications for your environment. If you already have an environment object, you can obtain these specifications using

`getObservationInfo`

.Create action specifications for your environment. If you already have an environment object, you can obtain these specifications using

`getActionInfo`

.If needed, specify the number of neurons in each learnable layer of the default network or whether to use an LSTM layer. To do so, create an agent initialization option object using

`rlAgentInitializationOptions`

.If needed, specify agent options using an

`rlTD3AgentOptions`

object.Create the agent using an

`rlTD3Agent`

object.

Alternatively, you can create actor and critics and use these objects to create your agent. In this case, ensure that the input and output dimensions of the actor and critics match the corresponding action and observation specifications of the environment.

Create an actor using an

`rlContinuousDeterministicActor`

object.Create one or two critics using

`rlQValueFunction`

objects.Specify agent options using an

`rlTD3AgentOptions`

object (alternatively, you can skip this step and then modify the agent options later using dot notation).Create the agent using an

`rlTD3Agent`

object.

For more information on creating actors and critics for function approximation, see Create Policies and Value Functions.

### Training Algorithm

TD3 agents use the following training algorithm, in which they update their actor and
critic models at each time step. To configure the training algorithm, specify options using
an `rlTD3AgentOptions`

object. Here, *K* = 2 is the number of critics and *k* is the critic
index.

Initialize each critic

*Q*(_{k}*S*,*A*;*ϕ*) with random parameter values_{k}*ϕ*, and initialize each target critic with the same random parameter values: $${\varphi}_{tk}={\varphi}_{k}$$._{k}Initialize the actor

*π*(*S*;*θ*) with random parameter values*θ*, and initialize the target actor with the same parameter values: $${\theta}_{t}=\theta $$.For each training time step:

For the current observation

*S*, select action*A*=*π*(*S*;*θ*) +*N*, where*N*is stochastic noise from the noise model. To configure the noise model, use the`ExplorationModel`

option.Execute action

*A*. Observe the reward*R*and next observation*S'*.Store the experience (

*S*,*A*,*R*,*S'*) in the experience buffer. To specify the size of the experience buffer, use the`ExperienceBufferLength`

option in the agent`rlTD3AgentOptions`

object.Sample a random mini-batch of

*M*experiences (*S*,_{i}*A*,_{i}*R*,_{i}*S'*) from the experience buffer. To specify_{i}*M*, use the`MiniBatchSize`

option.If

*S'*is a terminal state, set the value function target_{i}*y*to_{i}*R*. Otherwise, set it to_{i}$${y}_{i}={R}_{i}+\gamma *\underset{k}{\mathrm{min}}\left({Q}_{tk}\left({S}_{i}\text{'},\text{clip}\left({\pi}_{t}\left({S}_{i}\text{'};{\theta}_{t}\right)+\epsilon \right);{\varphi}_{tk}\right)\right)$$

The value function target is the sum of the experience reward

*R*and the minimum discounted future reward from the critics. To specify the discount factor_{i}*γ*, use the`DiscountFactor`

option.To compute the cumulative reward, the agent first computes a next action by passing the next observation

*S'*from the sampled experience to the target actor. Then, the agent adds noise_{i}*ε*to the computed action using the`TargetPolicySmoothModel`

, and clips the action based on the upper and lower noise limits. The agent finds the cumulative rewards by passing the next action to the target critics.If you specify a value of

`NumStepsToLookAhead`

equal to*N*, then the*N*-step return (which adds the rewards of the following*N*steps and the discounted estimated value of the state that caused the*N*-th reward) is used to calculate the target*y*._{i}At every time training step, update the parameters of each critic by minimizing the loss

*L*across all sampled experiences._{k}$${L}_{k}=\frac{1}{2M}{\displaystyle \sum _{i=1}^{M}{\left({y}_{i}-{Q}_{k}\left({S}_{i},{A}_{i};{\varphi}_{k}\right)\right)}^{2}}$$

Every

*D*_{1}steps, update the actor parameters using the following sampled policy gradient to maximize the expected discounted cumulative long-term reward. To set*D*_{1}, use the`PolicyUpdateFrequency`

option.$$\begin{array}{l}{\nabla}_{\theta}J\approx \frac{1}{M}{\displaystyle \sum _{i=1}^{M}{G}_{ai}{G}_{\pi i}}\\ {G}_{ai}={\nabla}_{A}\underset{k}{\mathrm{min}}\left({Q}_{k}\left({S}_{i},A;\varphi \right)\right)\text{\hspace{1em}}\text{where}\text{\hspace{0.17em}}A=\pi \left({S}_{i};\theta \right)\\ {G}_{\pi i}={\nabla}_{\theta}\pi \left({S}_{i};\theta \right)\end{array}$$

Here,

*G*is the gradient of the minimum critic output with respect to the action computed by the actor network, and_{ai}*G*is the gradient of the actor output with respect to the actor parameters. Both gradients are evaluated for observation_{πi}*S*._{i}Every

*D*_{2}steps, update the target actor and critics depending on the target update method. To specify*D*_{2}, use the`TargetUpdateFrequency`

option. For more information, see Target Update Methods.

For simplicity, the actor and critic updates in this algorithm show a gradient update
using basic stochastic gradient descent. The actual gradient update method depends on the
optimizer you specify in the `rlOptimizerOptions`

object assigned to the
`rlCriticOptimizerOptions`

property.

### Target Update Methods

TD3 agents update their target actor and critic parameters using one of the following target update methods.

**Smoothing**— Update the target parameters at every time step using smoothing factor*τ*. To specify the smoothing factor, use the`TargetSmoothFactor`

option.$$\begin{array}{l}{\varphi}_{tk}=\tau {\varphi}_{k}+\left(1-\tau \right){\varphi}_{tk}\text{\hspace{1em}}\left(\text{criticparameters}\right)\\ {\theta}_{t}=\tau \theta +\left(1-\tau \right){\theta}_{t}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}\left(\text{actorparameters}\right)\end{array}$$

**Periodic**— Update the target parameters periodically without smoothing (`TargetSmoothFactor = 1`

). To specify the update period, use the`TargetUpdateFrequency`

parameter.$$\begin{array}{l}{\varphi}_{tk}={\varphi}_{k}\\ {\theta}_{t}=\theta \end{array}$$

**Periodic Smoothing**— Update the target parameters periodically with smoothing.

To configure the target update method, create a `rlTD3AgentOptions`

object, and set the `TargetUpdateFrequency`

and
`TargetSmoothFactor`

parameters as shown in the following table.

Update Method | `TargetUpdateFrequency` | `TargetSmoothFactor` |
---|---|---|

Smoothing (default) | `1` | Less than `1` |

Periodic | Greater than `1` | `1` |

Periodic smoothing | Greater than `1` | Less than `1` |

## References

[1] Fujimoto, Scott, Herke van Hoof,
and David Meger. "Addressing Function Approximation Error in Actor-Critic Methods".
*ArXiv:1802.09477 [Cs, Stat]*, 22 October 2018. https://arxiv.org/abs/1802.09477.