Define Reward and Observation Signals in Custom Environments

To guide the learning process, reinforcement learning uses a scalar reward signal generated from the environment. This signal measures the performance of the agent with respect to the task goals. In other words, for a given observation (state), the reward measures the immediate effectiveness of taking a particular action. During training, an agent updates its policy based on the rewards received for different state-action combinations. For an introduction to different types of agents and how they use the reward signal during training, see Reinforcement Learning Agents.

In general, you provide a positive reward to encourage certain agent actions and a negative reward (penalty) to discourage other actions. A well-designed reward signal guides the agent to maximize the expectation of the (possibly discounted) cumulative long-term reward. What constitutes a well-designed reward depends on your application and the agent goals.

For example, when an agent must perform a task for as long as possible, a common strategy is to provide a small positive reward for each time step that the agent successfully performs the task and a large penalty when the agent fails. This approach encourages longer training episodes while heavily discouraging actions that lead to episodes in which the agent fails. For an example that uses this approach, see Train DQN Agent to Balance Cart-Pole System.

If your reward function incorporates multiple signals, such as position, velocity, and control effort, you must consider the relative sizes of the signals and scale their contributions to the reward signal accordingly.

You can specify either continuous or discrete reward signals. In either case, you must provide a reward signal that provides rich information when the action and observation signals change.

For control system applications in which cost functions and constraints are already available, you can also use generate rewards functions from such specifications.

Continuous Rewards

A continuous reward function varies continuously with changes in the environment observations and actions. In general, continuous reward signals improve convergence during training and can lead to simpler network structures.

An example of a continuous reward is the quadratic regulator (QR) cost function, where the cumulative long-term reward can be expressed as:

$J_{i} = - (s_{τ}^{T} Q_{τ} s_{τ} + \sum_{j = i}^{τ} s_{j}^{T} Q_{j} s_{j} + a_{j}^{T} R_{j} a_{j} + 2 s_{j}^{T} N_{j} a_{j})$

Here, Q_τ, Q, R, and N are the weight matrices. Q_τ is the terminal weight matrix, applied only at the end of the episode. Also, s is the observation vector, a is the action vector, and τ is the terminal iteration of the episode. The (instantaneous) reward for this cost function is

$r_{i} = s_{i}^{T} Q_{i} s_{i} + a_{i}^{T} R_{i} a_{i} + 2 s_{i}^{T} N_{i} a_{i}$

This QR reward structure encourages an agent to drive s to zero with minimal action effort. A QR-based reward structure is a good reward to choose for regulation or stationary point problems, such as pendulum swing-up or regulating the position of the double integrator. For training examples that use a QR reward, see Train DQN Agent to Swing Up and Balance Pendulum and Compare DDPG Agent to LQR Controller.

Smooth continuous rewards, such as the QR regulator, are good for fine-tuning parameters and can provide policies similar to optimal controllers (LQR/MPC).

Discrete Rewards

A discrete reward function varies discontinuously with changes in the environment observations or actions. These types of reward signals can make convergence slower and can require more complex network structures. Discrete rewards are usually implemented as events that occur in the environment—for example, when an agent receives a positive reward if it exceeds some target value or a penalty when it violates some performance constraint.

While discrete rewards can slow down convergence, they can also guide the agent toward better reward regions in the state space of the environment. For example, a region-based reward, such as a fixed reward when the agent is near a target location, can emulate final-state constraints. Also, a region-based penalty can encourage an agent to avoid certain areas of the state space.

Mixed Rewards

In many cases, providing a mixed reward signal that has a combination of continuous and discrete reward components is beneficial. The discrete reward signal can be used to drive the system away from bad states, and the continuous reward signal can improve convergence by providing a smooth reward near target states. For example, in Train DDPG Agent to Control Sliding Robot, the reward function has three components: r₁, r₂, and r₃.

$\begin{array}{l} r_{1} = 10 ((x_{t}^{2} + y_{t}^{2} + θ_{t}^{2}) < 0.5) \\ r_{2} = - 100 (| x_{t} | \geq 20 | | | y_{t} | \geq 20) \\ r_{3} = - (0.2 {(R_{t - 1} + L_{t - 1})}^{2} + 0.3 {(R_{t - 1} - L_{t - 1})}^{2} + 0.03 x_{t}^{2} + 0.03 y_{t}^{2} + 0.02 θ_{t}^{2}) \\ r = r_{1} + r_{2} + r_{3} \end{array}$

Here:

r₁ is a region-based continuous reward that applies only near the target location of the robot.
r₂ is a discrete signal that provides a large penalty when the robot moves far from the target location.
r₃ is a continuous QR penalty that applies for all robot states.

Reward Generation from Control Specifications

For applications where a working control system already exists, specifications such as cost functions or constraints might already be available. In these cases, you can use generateRewardFunction to automatically generate a reward function, coded in MATLAB^®, that can be used as a starting point for reward design. This function allows you to generate reward functions from:

Cost and constraint specifications defined in an mpc (Model Predictive Control Toolbox) or nlmpc (Model Predictive Control Toolbox) controller object. This feature requires Model Predictive Control Toolbox™ software.
Performance constraints defined in Simulink^® Design Optimization™ model verification blocks.

In both cases, when constraints are violated, a negative reward is calculated using penalty functions such as exteriorPenalty (default), hyperbolicPenalty or barrierPenalty functions.

Starting from the generated reward function, you can tune the cost and penalty weights, use a different penalty function, and then use the resulting reward function within an environment to train an agent.

Observation Signals

Choosing the right observations for the agent is crucial for effective training and performance in reinforcement learning. The observations should provide sufficient information about the current environment state (independently on past environment states) for the agent to make informed decisions.

For example, for control system applications, while the observations depend on your application, the integrals (and sometimes derivatives) of error signals are often useful observations. Also, for reference-tracking applications, having a time-varying reference signal as an observation is helpful.

One important consideration is that a reinforcement learning environment is normally assumed to be strictly causal from the current action to the current observation. That is, it is assumed that the current observation does not depend on the current action (while the next state generally does). In other words, there must be no direct feedthrough between the current action and the current observation.

If the environment state is low-dimensional, and all the states are available for measurement, it is best practice to include all the available environment states in the observation vector. This ensures that the all necessary information about the environment can be captured by the agent. Failure to do so can lead to situations in which different environment states result in the same observation. For such states, the agent policy (assuming it is a static function of the observation) can only return the same action. Such a policy is typically unsuccessful, because it is normally the case that a successful policy needs to react to different environment states by returning different actions.

For example, an image observation of a swinging pendulum has position information but does not have enough information, by itself, to determine the pendulum velocity. In this case, a static policy that cannot sense the velocity would not be able to stabilize the pendulum. But if the velocity can be measured or estimated, adding it as an additional entry in the observation vector will provide a static policy with enough information to stabilize the pendulum.

When not all states are available as observation signals (for example because it would be unrealistic to measure them), a possible workaround is to use an estimator (as a part of the environment) that estimates the values of the unmeasured states, and makes such estimates available to the agent as observations. Alternatively, you can use recurrent networks such as an LSTM in your policy. Doing so results in a policy that has states, and that might therefore be able to use its state as an internal representation of the environment state. Such a policy can consequently return different actions (based on different values of its internal state) even when there is not enough information to reconstruct the correct environment state from the current observation.

When the state or action space is large, it becomes challenging for the agent to explore and learn effectively. The curse of dimensionality can make it difficult to find an optimal policy within a reasonable time frame. Therefore, if the observation space is high dimensional (for example containing images) it may be necessary to preprocess the observations. When images are used as observations, techniques like image resizing, cropping, or converting to grayscale can help reduce dimensionality and focus on relevant features.

Depending on the complexity of the task, you may need to perform feature engineering to extract relevant information from the observations. This can involve transforming or combining the raw observations to create more informative features. In general, it is best practice to leverage any domain knowledge or insights you have about the task to guide the selection of observations. Understanding the underlying environment dynamics and relevant factors can help you identify the most informative observations for the agent.

The selection or observation (and action) signals has an important effect on the agent training and its convergence. For more information, see the last section of Train Reinforcement Learning Agents.