Create Custom Environment Using Step and Reset Functions
This example shows how to create a custom environment by writing your own MATLAB® step and reset functions.
Using the rlFunctionEnv function, you can create a MATLAB reinforcement learning environment from an observation specification, an action specification, and step and reset functions that you supply. You can then train a reinforcement learning agent in this environment. For this example, the necessary step and reset functions are already defined.
Creating an environment using custom functions is especially useful for simpler environments that do not need many helper functions and have no special visualization requirements. For more complex environments, you can create an environment object using a template class. For more information, see Create Custom Environment from Class Template.
For more information on creating reinforcement learning environments, see Reinforcement Learning Environments and Create Custom Simulink Environments.
Fix Random Number Stream for Reproducibility
The example code might involve computation of random numbers at various stages. Fixing the random number stream at the beginning of various sections in the example code preserves the random number sequence in the section every time you run it, and increases the likelihood of reproducing the results. For more information, see Results Reproducibility.
Fix the random number stream with seed 0 and random number algorithm Mersenne Twister. For more information on controlling the seed used for random number generation, see rng.
previousRngState = rng(0,"twister");The output previousRngState is a structure that contains information about the previous state of the stream. You will restore the state at the end of the example. 
DIscrete Action Space Cart-Pole MATLAB Environment
The cart-pole environment is a pole attached to an unactuated joint on a cart, which moves along a frictionless track. The training goal is to make the pendulum stand upright.

For this environment:
- The balanced, upright pendulum position is zero radians and the downward hanging pendulum position is radians. 
- The pendulum starts upright with an initial angle that is between –0.05 radians and 0.05 radians. 
- The force action signal from the agent to the environment is either –10 N or 10 N. 
- The observations from the environment are the cart position, cart velocity, pendulum angle, and pendulum angular velocity. 
- The episode terminates if the pole is more than 12 degrees from vertical or the cart moves more than 2.4 m from the original position. 
- A reward of +1 is provided for every time step that the pole remains upright. A penalty of –10 is applied when the pendulum falls. 
For more information on this model, see Load Predefined Control System Environments.
For this example, instead of loading the predefined cart-pole environment using rlPredefinedEnv, you implement a basic version of this environment by supplying your step and reset functions.
Observation and Action Specifications
The observations from the environment are the cart position, cart velocity, pendulum angle, and pendulum angle derivative.
ObsInfo = rlNumericSpec([4 1]); ObsInfo.Name = "CartPole States"; ObsInfo.Description = 'x, dx, theta, dtheta';
The environment has a discrete action space where the agent can apply one of two possible force values to the cart: -10 or 10 N.
ActInfo = rlFiniteSetSpec([-10 10]);
ActInfo.Name = "CartPole Action";For more information on specifying environment actions and observations, see rlNumericSpec and rlFiniteSetSpec.
Define Environment Reset and Step Functions
To define a custom environment, first specify the custom step and reset functions. These functions must be in your current working folder or on the MATLAB path.
The reset function sets the initial state of the environment. This function must have the following signature.
[InitialObservation,Info] = myResetFunction()
The first output argument is the initial observation. The second output argument can be any useful environment information that you want to pass from one step to the next, such as for example the environment state, or a structure containing state and parameters.
At the beginning of the training (or simulation) episode, train (or sim) calls your reset function and uses its second output argument to initialize the Info property of your custom environment. During a training (or simulation) step, train (or sim) supplies the current value of Info as the second input argument of StepFcn, and then uses the fourth output argument returned by StepFcn to update the value of Info.
For this example, use the second argument to store the initial states of the cart-pole environment: the position and velocity of the cart, the pendulum angle (clockwise-politive), and the pendulum angle derivative. The reset function sets the cart angle to a random value each time the environment is reset.
For this example, use the custom reset function defined in myResetFunction.m.
type myResetFunction.mfunction [InitialObservation, InitialState] = myResetFunction() % Reset function to place custom cart-pole environment into a random % initial state. % Theta (randomize) T0 = 2 * 0.05 * rand() - 0.05; % Thetadot Td0 = 0; % X X0 = 0; % Xdot Xd0 = 0; % Return initial environment state variables as logged signals. InitialState = [X0;Xd0;T0;Td0]; InitialObservation = InitialState; end
The step function specifies how the environment advances to the next state based on a given action. This function must have the following signature.
[NextObservation,Reward,IsDone,UpdatedInfo] = myStepFunction(Action,Info)
To calculate the new state, the step function applies the dynamic equation to the current state stored in Info. The function then returns the updated state in UpdatedInfo. At the next training (or simulation) step, train (or sim) takes the fourth output argument obtained during the previous step, UpdatedInfo, and supplies it to the step function as the second input argument, Info. Note that Action affects only NextObservation and the associated Reward, but does not affect the current observation. In other words, there is no direct feedthrough between action and observation.
For this example, use the custom step function defined in myStepFunction.m. For implementation simplicity, this function redefines physical constants, such as the cart mass, every time step is executed. An alternative is to define the physical constants in the reset function, define Info as a structure containing both state and parameters, and therefore use Info to store the physical constants as well as the environment states. This alternative implementation lets you easily change some of the parameters during the simulation or training if you need to.
type myStepFunction.mfunction [NextObs,Reward,IsDone,NextState] = myStepFunction(Action,State)
% Custom step function to construct cart-pole environment for the function
% name case.
%
% This function applies the given action to the environment and evaluates
% the system dynamics for one simulation step.
% Define the environment constants.
% Acceleration due to gravity in m/s^2
Gravity = 9.8;
% Mass of the cart
CartMass = 1.0;
% Mass of the pole
PoleMass = 0.1;
% Half the length of the pole
HalfPoleLength = 0.5;
% Max force the input can apply
MaxForce = 10;
% Sample time
Ts = 0.02;
% Pole angle at which to fail the episode
AngleThreshold = 12 * pi/180;
% Cart distance at which to fail the episode
DisplacementThreshold = 2.4;
% Reward each time step the cart-pole is balanced
RewardForNotFalling = 1;
% Penalty when the cart-pole fails to balance
PenaltyForFalling = -10;
% Check if the given action is valid.
if ~ismember(Action,[-MaxForce MaxForce])
    error('Action must be %g for going left and %g for going right.',...
        -MaxForce,MaxForce);
end
Force = Action;
% Unpack the state vector from the logged signals.
XDot = State(2);
Theta = State(3);
ThetaDot = State(4);
% Cache to avoid recomputation.
CosTheta = cos(Theta);
SinTheta = sin(Theta);
SystemMass = CartMass + PoleMass;
temp = (Force + PoleMass*HalfPoleLength*ThetaDot*ThetaDot*SinTheta)/SystemMass;
% Apply motion equations.
ThetaDotDot = (Gravity*SinTheta - CosTheta*temp) / ...
    (HalfPoleLength*(4.0/3.0 - PoleMass*CosTheta*CosTheta/SystemMass));
XDotDot  = temp - PoleMass*HalfPoleLength*ThetaDotDot*CosTheta/SystemMass;
% Perform Euler integration to calculate next state.
NextState = State + Ts.*[XDot;XDotDot;ThetaDot;ThetaDotDot];
% Copy next state to next observation.
NextObs = NextState;
% Check terminal condition.
X = NextObs(1);
Theta = NextObs(3);
IsDone = abs(X) > DisplacementThreshold || abs(Theta) > AngleThreshold;
% Calculate reward.
if ~IsDone
    Reward = RewardForNotFalling;
else
    Reward = PenaltyForFalling;
end
end
Use rlFunctionEnv to create a custom environment object using the observation and action specification, and the names of your step and reset functions.
env = rlFunctionEnv(ObsInfo,ActInfo,"myStepFunction","myResetFunction");
To verify the operation of your environment, rlFunctionEnv automatically calls validateEnvironment after creating the environment.
Pass Additional Arguments Using Anonymous Functions
While the custom reset and step functions that you must pass to rlFunctionEnv must have exactly zero and two arguments, respectively, you can avoid this limitation by using anonymous functions. Specifically, you define the reset and step functions to be passed to rlFunctionEnv as anonymous functions (with zero and two arguments, respectively). In turn, these anonymous functions, call your custom functions that have additional arguments.
For example, to pass the additional arguments arg1 and arg2 to both the step and reset function, you can write the following functions.
[InitialObservation,Info] = myResetFunction(arg1,arg2) [Observation,Reward,IsDone,Info] = myStepFunction(Action,Info,arg1,arg2)
Then, with arg1 and arg2 in the MATLAB workspace, define the following handles to anonymous reset and step functions with zero and two arguments, respectively.
ResetHandle = @() myResetFunction(arg1,arg2); StepHandle = @(Action,Info) myStepFunction(Action,Info,arg1,arg2);
If arg1 and arg2 are available at the time that ResetHandle and StepHandle are created, the workspaces of both anonymous functions include those values. The values persist within the function workspaces even if you clear the variables from the MATLAB workspace. When ResetHandle and StepHandle are evaluated, they invoke myResetFunction and myStepFunction and pass them a copy of arg1 and arg2. For more information, see Anonymous Functions.
Using additional input arguments can create a more efficient environment implementation. For example, myStepFunction2.m is a custom step function that takes the environment constants as its third input argument (envPars). By doing so, this function avoids redefining the environment constants at each step.
type myStepFunction2.mfunction [NextObs,Reward,IsDone,NextState] = myStepFunction2(Action,State,EnvPars)
% Custom step function to construct cart-pole environment for the function
% handle case.
%
% This function applies the given action to the environment and evaluates
% the system dynamics for one simulation step.
% Check if the given action is valid.
if ~ismember(Action,[-EnvPars.MaxForce EnvPars.MaxForce])
    error('Action must be %g for going left and %g for going right.',...
        -EnvPars.MaxForce,EnvPars.MaxForce);
end
Force = Action;
% Unpack the state vector from the logged signals.
XDot = State(2);
Theta = State(3);
ThetaDot = State(4);
% Cache to avoid recomputation.
CosTheta = cos(Theta);
SinTheta = sin(Theta);
MassPole = EnvPars.MassPole;
HalfLen = EnvPars.HalfLength;
SystemMass = EnvPars.MassCart + MassPole;
temp = (Force + MassPole*HalfLen*ThetaDot*ThetaDot*SinTheta)/SystemMass;
% Apply motion equations.
ThetaDotDot = (EnvPars.Gravity*SinTheta - CosTheta*temp)...
    / (HalfLen*(4.0/3.0 - MassPole*CosTheta*CosTheta/SystemMass));
XDotDot  = temp - MassPole*HalfLen*ThetaDotDot*CosTheta/SystemMass;
% Perform Euler integration.
NextState = State + EnvPars.Ts.*[XDot;XDotDot;ThetaDot;ThetaDotDot];
% Copy next state to next observation.
NextObs = NextState;
% Check terminal condition.
X = NextObs(1);
Theta = NextObs(3);
IsDone = abs(X) > EnvPars.XThreshold || ...
         abs(Theta) > EnvPars.ThetaThresholdRadians;
% Calculate reward.
if ~IsDone
    Reward = EnvPars.RewardForNotFalling;
else
    Reward = EnvPars.PenaltyForFalling;
end
end
Create the structure that contains the environment parameters.
% Acceleration due to gravity in m/s^2 envPars.Gravity = 9.8; % Mass of the cart envPars.MassCart = 1.0; % Mass of the pole envPars.MassPole = 0.1; % Half the length of the pole envPars.HalfLength = 0.5; % Max force the input can apply envPars.MaxForce = 10; % Sample time envPars.Ts = 0.02; % Angle at which to fail the episode envPars.ThetaThresholdRadians = 12 * pi/180; % Distance at which to fail the episode envPars.XThreshold = 2.4; % Reward each time step the cart-pole is balanced envPars.RewardForNotFalling = 1; % Penalty when the cart-pole fails to balance envPars.PenaltyForFalling = -5;
Create an anonymous function that calls your custom step function, passing envPars as an additional input argument. 
StepHandle = @(Action,Info) myStepFunction2(Action,Info,envPars);
Because envPars is available at the time that StepHandle is created, the anonymous function workspace includes a copy of envPars. When StepHandle is evaluated, it calls myStepFunction2 passing its copy of envPars.
Use the same reset function, specifying it as a function handle rather than by using its name.
ResetHandle = @() myResetFunction;
Create the environment using the handles to the anonymous functions.
env2 = rlFunctionEnv(ObsInfo,ActInfo,StepHandle,ResetHandle);
Visually Inspect Outputs of Custom Functions
While rlFunctionEnv automatically calls validateEnvironment after creating the environment, it might be useful to visually inspect the output of your functions to further confirm that their behavior conforms to your expectations. To do so, initialize your environment using the reset function and run one simulation step using the step function. For reproducibility, set the random generator seed before validation.
Validate the environment created using function names. Fix the random number stream so that the reset function always returns the same initial observation.
rng(0,"twister");
InitialObs = reset(env)InitialObs = 4×1
         0
         0
    0.0315
         0
[NextObs,Reward,IsDone,Info] = step(env,10); NextObs
NextObs = 4×1
         0
    0.1947
    0.0315
   -0.2826
Validate the environment created using function handles. Fix the random number stream so that the reset function always returns the same initial observation.
rng(0,"twister");
InitialObs2 = reset(env2)InitialObs2 = 4×1
         0
         0
    0.0315
         0
[NextObs2,Reward2,IsDone2,Info2] = step(env2,10); NextObs2
NextObs2 = 4×1
         0
    0.1947
    0.0315
   -0.2826
Both environments initialize and simulate successfully, producing the same state values in NextObs.
Restore the random number stream using the information stored in previousRngState.
rng(previousRngState);