rlEpsilonGreedyPolicy

Policy object to generate discrete epsilon-greedy actions for custom training loops

Since R2022a

Description

This object implements an epsilon-greedy policy, which returns either the action that maximizes a discrete action-space Q-value function, with probability 1-Epsilon, or a random action otherwise, given an input observation. You can create an rlEpsilonGreedyPolicy object from an rlQValueFunction or rlVectorQValueFunction object, or extract it from an rlQAgent, rlDQNAgent or rlSARSAAgent. You can then train the policy object using a custom training loop or deploy it for your application. If UseEpsilonGreedyAction is set to 0 the policy is deterministic, therefore in this case it does not explore. This object is not compatible with generatePolicyBlock and generatePolicyFunction. For more information on policies and value functions, see Create Actors, Critics, and Policy Objects.

Creation

Syntax

policy = rlEpsilonGreedyPolicy(qValueFunction)

Description

policy = rlEpsilonGreedyPolicy(qValueFunction) creates the epsilon-greedy policy object policy from the discrete action-space Q-value function qValueFunction. It also sets the QValueFunction property of policy to the input argument qValueFunction.

example

Properties

expand all

`QValueFunction` — Discrete action-space Q-value function
`rlQValueFunction` object | `rlVectorQValueFunction` object

Discrete action-space Q-value function approximator, specified as an rlQValueFunction or rlVectorQValueFunction object.

`ExplorationOptions` — Noise model options
`EpsilonGreedyExploration` object

Exploration options, specified as an EpsilonGreedyExploration object. Changing the noise state or any exploration option of an rlEpsilonGreedyPolicy object deployed through code generation is not supported.

For more information, see the EpsilonGreedyExploration property in rlQAgentOptions.

`Normalization` — Normalization method
`"none"` (default) | string array

Normalization method, returned as an array in which each element (one for each input channel defined in the observationInfo and actionInfo properties, in that order) is one of the following values:

"none" — Do not normalize the input.
"rescale-zero-one" — Normalize the input by rescaling it to the interval between 0 and 1. The normalized input Y is (U–Min)./(UpperLimit–LowerLimit), where U is the nonnormalized input. Note that nonnormalized input values lower than LowerLimit result in normalized values lower than 0. Similarly, nonnormalized input values higher than UpperLimit result in normalized values higher than 1. Here, UpperLimit and LowerLimit are the corresponding properties defined in the specification object of the input channel.
"rescale-symmetric" — Normalize the input by rescaling it to the interval between –1 and 1. The normalized input Y is 2(U–LowerLimit)./(UpperLimit–LowerLimit) – 1, where U is the nonnormalized input. Note that nonnormalized input values lower than LowerLimit result in normalized values lower than –1. Similarly, nonnormalized input values higher than UpperLimit result in normalized values higher than 1. Here, UpperLimit and LowerLimit are the corresponding properties defined in the specification object of the input channel.

Note

When you specify the Normalization property of rlAgentInitializationOptions, normalization is applied only to the approximator input channels corresponding to rlNumericSpec specification objects in which both the UpperLimit and LowerLimit properties are defined. After you create the agent, you can use the setNormalizer function to assign normalizers that use any normalization method. For more information on normalizer objects, see rlNormalizer.

Example: myActor.Normalization = "rescale-symmetric" sets to "rescale-symmetric" the Normalization property of the function approximator myActor.

`UseEpsilonGreedyAction` — Option to enable epsilon-greedy actions
`true` (default) | `false`

Option to enable epsilon-greedy actions, specified as a logical value: either true (default, enabling epsilon-greedy actions, which helps exploration) or false (epsilon-greedy actions not enabled). When epsilon-greedy actions are disabled the policy is deterministic and therefore it does not explore.

Example: false

`EnableEpsilonDecay` — Option to enable epsilon decay
`true` (default) | `false`

Option to enable epsilon decay, specified as a logical value: either true (default, enabling epsilon decay) or false (disabling epsilon decay).

Example: false

`ObservationInfo` — Observation specifications
`rlFiniteSetSpec` object | `rlNumericSpec` object | array

Observation specifications, returned as an rlFiniteSetSpec or rlNumericSpec object or an array containing a mix of such objects. Each element in the array defines the properties of an environment observation channel, such as its dimensions, data type, and name.

This policy property is read-only.

`ActionInfo` — Action specifications
`rlFiniteSetSpec` object

Action specifications, returned as an rlFiniteSetSpec object. This object defines the properties of the environment action channel, such as its dimensions, data type, and name.

This policy property is read-only.

Note

For this policy object, only one action channel is allowed.

`SampleTime` — Sample time of policy
`1` (default) | positive scalar | `-1`

Sample time of the policy, specified as a positive scalar or as -1.

Within a MATLAB^® environment, the policy is executed every time you call it within your custom training loop, so, SampleTime does not affect the timing of the policy execution.

Within a Simulink^® environment, the Policy block that uses the policy object executes every SampleTime seconds of simulation time. If SampleTime is -1 the block inherits the sample time from its input signals. Set SampleTime to -1 when the block is a child of an event-driven subsystem.

Note

Set SampleTime to a positive scalar when the block is not a child of an event-driven subsystem. Doing so ensures that the block executes at appropriate intervals when input signal sample times change due to model variations.

If SampleTime is a positive scalar, this value is also the time interval between consecutive elements in the output experience returned by sim, regardless of the type of environment.

If SampleTime is -1, for Simulink environments, the time interval between consecutive elements in the returned output experience reflects the timing of the events that trigger the Policy block execution, while for MATLAB environments, this time interval is considered equal to 1.

Example: mypolicy.SampleTime = -1 sets the sample time of the policy object mypolicy to -1.

Object Functions

`getAction`	Obtain action from agent, actor, or policy object given environment observations
`getLearnableParameters`	Obtain learnable parameter values from agent, function approximator, or policy object
`reset`	Reset environment, agent, experience buffer, or policy object
`setLearnableParameters`	Set learnable parameter values of agent, function approximator, or policy object

Examples

collapse all

Create Epsilon-Greedy Policy Object from Vector Q-Value Function

Open Live Script

Create observation and action specification objects. For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing four doubles, and the action space as a finite set consisting of two possible row vectors, [1 0] and [0 1].

obsInfo = rlNumericSpec([4 1]);
actInfo = rlFiniteSetSpec({[1 0],[0 1]});

Alternatively, use the getObservationInfo and getActionInfo functions to extract the specification objects from an environment.

Create a vector Q-value function approximator to use as critic. A vector Q-value function must accept an observation as input and return a single vector with as many elements as the number of possible discrete actions.

To model the parameterized vector Q-value function within the critic, use a neural network. Define a single path from the network input to its output as an array of layer objects.

layers = [ 
    featureInputLayer(prod(obsInfo.Dimension))
    fullyConnectedLayer(10)
    reluLayer
    fullyConnectedLayer(numel(actInfo.Elements)) 
    ];

Convert the network to a dlnetwork object and display the number of weights.

model = dlnetwork(layers);
summary(model)

   Initialized: true

   Number of learnables: 72

   Inputs:
      1   'input'   4 features

Create a vector Q-value function using model, and the observation and action specifications.

qValueFcn = rlVectorQValueFunction(model,obsInfo,actInfo)

qValueFcn = 
  rlVectorQValueFunction with properties:

    ObservationInfo: [1×1 rl.util.rlNumericSpec]
         ActionInfo: [1×1 rl.util.rlFiniteSetSpec]
      Normalization: "none"
          UseDevice: "cpu"
         Learnables: {4×1 cell}
              State: {0×1 cell}

Check the critic with a batch of 10 random observation inputs.

robs = rand([obsInfo.Dimension 10]);
v = getValue(qValueFcn,{robs});

Display the seventh element in the batch.

v(:,7)

ans = 2×1 single column vector

    0.7737
   -0.3351

Create a policy object from qValueFcn.

policy = rlEpsilonGreedyPolicy(qValueFcn)

policy = 
  rlEpsilonGreedyPolicy with properties:

            QValueFunction: [1×1 rl.function.rlVectorQValueFunction]
        ExplorationOptions: [1×1 rl.option.EpsilonGreedyExploration]
             Normalization: "none"
    UseEpsilonGreedyAction: 1
        EnableEpsilonDecay: 1
           ObservationInfo: [1×1 rl.util.rlNumericSpec]
                ActionInfo: [1×1 rl.util.rlFiniteSetSpec]
                SampleTime: -1

Check the policy with a batch of 10 random observation inputs.

robs = rand([obsInfo.Dimension 10]);
act = getAction(policy,{robs});

Display the seventh element in the batch.

act{1}(7)

ans = 
0

You can now train the policy with a custom training loop and then deploy it to your application.

Version History

Introduced in R2022a

rlEpsilonGreedyPolicy

Description

Creation

Syntax

Description

Properties

`QValueFunction` — Discrete action-space Q-value function
`rlQValueFunction` object | `rlVectorQValueFunction` object

`ExplorationOptions` — Noise model options
`EpsilonGreedyExploration` object

`Normalization` — Normalization method
`"none"` (default) | string array

`UseEpsilonGreedyAction` — Option to enable epsilon-greedy actions
`true` (default) | `false`

`EnableEpsilonDecay` — Option to enable epsilon decay
`true` (default) | `false`

`ObservationInfo` — Observation specifications
`rlFiniteSetSpec` object | `rlNumericSpec` object | array

`ActionInfo` — Action specifications
`rlFiniteSetSpec` object

`SampleTime` — Sample time of policy
`1` (default) | positive scalar | `-1`

Object Functions

Examples

Create Epsilon-Greedy Policy Object from Vector Q-Value Function

Version History

See Also

Functions

Objects

Blocks

Topics

rlEpsilonGreedyPolicy

Description

Creation

Syntax

Description

Properties

QValueFunction — Discrete action-space Q-value function rlQValueFunction object | rlVectorQValueFunction object

ExplorationOptions — Noise model options EpsilonGreedyExploration object

Normalization — Normalization method "none" (default) | string array

UseEpsilonGreedyAction — Option to enable epsilon-greedy actions true (default) | false

EnableEpsilonDecay — Option to enable epsilon decay true (default) | false

ObservationInfo — Observation specifications rlFiniteSetSpec object | rlNumericSpec object | array

ActionInfo — Action specifications rlFiniteSetSpec object

SampleTime — Sample time of policy 1 (default) | positive scalar | -1

Object Functions

Examples

Create Epsilon-Greedy Policy Object from Vector Q-Value Function

Version History

See Also

Functions

Objects

Blocks

Topics

`QValueFunction` — Discrete action-space Q-value function
`rlQValueFunction` object | `rlVectorQValueFunction` object

`ExplorationOptions` — Noise model options
`EpsilonGreedyExploration` object

`Normalization` — Normalization method
`"none"` (default) | string array

`UseEpsilonGreedyAction` — Option to enable epsilon-greedy actions
`true` (default) | `false`

`EnableEpsilonDecay` — Option to enable epsilon decay
`true` (default) | `false`

`ObservationInfo` — Observation specifications
`rlFiniteSetSpec` object | `rlNumericSpec` object | array

`ActionInfo` — Action specifications
`rlFiniteSetSpec` object

`SampleTime` — Sample time of policy
`1` (default) | positive scalar | `-1`