Main Content

createMDP

Create Markov decision process object

Description

A Markov decision process (MDP) is a discrete-time stochastic control process in which the state and observation belong to finite spaces, and stochastic rules govern state transitions. MDPs are useful for studying optimization problems solved using reinforcement learning. Use the createMDP function to create a GenericMDP object with specified states and transitions. You can then modify some of the object properties and pass it to rlMDPEnv to create an environment that agents can interact with.

MDP = createMDP(states,actions) creates a Markov decision process object with the specified states and actions.

example

Examples

collapse all

Create a GenericMDP object with eight states and two possible actions.

MDP = createMDP(8,["up";"down"])
MDP = 
  GenericMDP with properties:

            CurrentState: "s1"
                  States: [8×1 string]
                 Actions: [2×1 string]
                       T: [8×8×2 double]
                       R: [8×8×2 double]
          TerminalStates: [0×1 string]
    ProbabilityTolerance: 8.8818e-16

Specify the state transitions and their associated rewards.

% State 1 transition and reward
MDP.T(1,2,1) = 1;
MDP.R(1,2,1) = 3;
MDP.T(1,3,2) = 1;
MDP.R(1,3,2) = 1;

% State 2 transition and reward
MDP.T(2,4,1) = 1;
MDP.R(2,4,1) = 2;
MDP.T(2,5,2) = 1;
MDP.R(2,5,2) = 1;

% State 3 transition and reward
MDP.T(3,5,1) = 1;
MDP.R(3,5,1) = 2;
MDP.T(3,6,2) = 1;
MDP.R(3,6,2) = 4;

% State 4 transition and reward
MDP.T(4,7,1) = 1;
MDP.R(4,7,1) = 3;
MDP.T(4,8,2) = 1;
MDP.R(4,8,2) = 2;

% State 5 transition and reward
MDP.T(5,7,1) = 1;
MDP.R(5,7,1) = 1;
MDP.T(5,8,2) = 1;
MDP.R(5,8,2) = 9;

% State 6 transition and reward
MDP.T(6,7,1) = 1;
MDP.R(6,7,1) = 5;
MDP.T(6,8,2) = 1;
MDP.R(6,8,2) = 1;

% State 7 transition and reward
MDP.T(7,7,1) = 1;
MDP.R(7,7,1) = 0;
MDP.T(7,7,2) = 1;
MDP.R(7,7,2) = 0;

% State 8 transition and reward
MDP.T(8,8,1) = 1;
MDP.R(8,8,1) = 0;
MDP.T(8,8,2) = 1;
MDP.R(8,8,2) = 0;

Specify the terminal states of the model.

MDP.TerminalStates = ["s7";"s8"];

You can now pass MDP to rlMDPEnv to create an environment in which you can train and simulate your agents.

Input Arguments

collapse all

Model states, specified as one of the following:

  • Positive integer — Specify the number of model states. In this case, each state has a default name, such as "s1" for the first state.

  • String vector — Specify the state names. In this case, the total number of states is equal to the length of the vector.

Model actions, specified as one of the following:

  • Positive integer — Specify the number of model actions. In this case, each action has a default name, such as "a1" for the first action.

  • String vector — Specify the action names. In this case, the total number of actions is equal to the length of the vector.

Output Arguments

collapse all

MDP model, returned as a GenericMDP object with these properties.

Name of the current state, specified as a string.

Example: MDP.CurrentState = "s2";

State names, specified as a string vector with length equal to the number of states.

Example: MDP.States = ["America";"Europe";"China"];

Action names, specified as a string vector with length equal to the number of actions.

Example: MDP.Actions = ["GoWest";"GoEast"];

State transition matrix, specified as a 3-D array, which determines the possible movements of the agent in an environment. State transition matrix T is a probability matrix that indicates the agent of the agent moving from the current state s to any possible next state s' by performing action a. T is an S-by-S-by-A array, where S is the number of states and A is the number of actions. It is given by:

T(s,s',a) = probability(s'|s,a)

The sum of the transition probabilities out from a nonterminal state s following a given action must add up to either one or zero. So, all stochastic transitions out of a given state must be specified at the same time.

For example, to indicate that in state 1 following action 4 there is an equal probability of moving to states 2 or 3, use this command:

MDP.T(1,[2 3],4) = [0.5 0.5];

You can also specify that, following an action, there is some probability of remaining in the same state.

MDP.T(1,[1 2 3 4],1) = [0.25 0.25 0.25 0.25];

Example: MDP.T(1,[1 2 3],1) = [0.25 0.5 0.25]

Reward transition matrix, specified as a 3-D array, which determines how much reward the agent receives after performing an action in the environment. R has the same shape and size as state transition matrix T. The reward for moving from state s to state s' by performing action a is given by:

r = R(s,s',a).

Example: MDP.T(1,[1 2 3],1) = [-1 0.5 2]

Terminal state names, specified as a string vector of state names.

Example: MDP.TerminalStates = "s3"

Version History

Introduced in R2019a