MATLAB Answers

On updating the policy with sim functions and Custom Loop

9 views (last 30 days)
shoki kobayashi
shoki kobayashi on 30 Nov 2020
Answered: Anh Tran on 8 Dec 2020
I'm currently trying to train the PPOAgent using the sim function and Custom Loop. However, when I use the sim function, the Actor and Critic networks don't update properly and keep repeating the same behavior. How can I get the network updates to work? Is it not a good idea to use the sim function instead of step in the first place... I think the sim function is the only way to do custom loops in the simlink environment, since step has the image of storing a history of actions, states and rewards in a buffer. We want to do it with a policy of not using functions.
% PPO without using the train function
clear all
rng(0)
%Construct Environment
mdl = 'cartpole';
open_system(mdl)
env = rlPredefinedEnv('CartPoleSimscapeModel-Continuous')
obsInfo = getObservationInfo(env);
numObs = obsInfo.Dimension(1);
actInfo = getActionInfo(env);
numAct = actInfo.Dimension(1);
Ts = 0.02;
Tf = 25;
PPOAgent
criticLayerSizes = [128 200];
actorLayerSizes = [128 200];
createNetworkWeights;
criticNetwork = [imageInputLayer([numObs 1 1],'Normalization','none','Name','observations')
fullyConnectedLayer(criticLayerSizes(1),'Name','CriticFC1')
reluLayer('Name','CriticRelu1')
fullyConnectedLayer(criticLayerSizes(2),'Name','CriticFC2')
reluLayer('Name','CriticRelu2')
fullyConnectedLayer(1,'Name','CriticOutput')
];
criticOpts = rlRepresentationOptions('LearnRate',1e-3);
critic = rlValueRepresentation(criticNetwork,env.getObservationInfo, ...
'Observation',{'observations'},criticOpts);
%ActorNetwork
inPath = [ imageInputLayer([numObs 1 1], 'Normalization','none','Name','observations')
fullyConnectedLayer(numAct,'Name','infc') ]; % 2 by 1 output
% path layers for mean value (2 by 1 input and 2 by 1 output)
% using scalingLayer to scale the range
meanPath = [ tanhLayer('Name','tanh'); % output range: (-1,1)
scalingLayer('Name','scale','Scale',actInfo.UpperLimit) ]; % output range: (-10,10)
% path layers for standard deviations (2 by 1 input and output)
% using softplus layer to make it non negative
sdevPath = softplusLayer('Name', 'splus');
outLayer = concatenationLayer(3,2,'Name','mean&sdev');
% add layers to network object
net = layerGraph(inPath);
net = addLayers(net,meanPath);
net = addLayers(net,sdevPath);
net = addLayers(net,outLayer);
% connect layers: the mean value path output MUST be connected to the FIRST input of the concatenationLayer
net = connectLayers(net,'infc','tanh/in'); % connect output of inPath to meanPath input
net = connectLayers(net,'infc','splus/in'); % connect output of inPath to sdevPath input
net = connectLayers(net,'scale','mean&sdev/in1'); % connect output of meanPath to gaussPars input #1
net = connectLayers(net,'splus','mean&sdev/in2'); % connect output of sdevPath to gaussPars input #2
actorOptions = rlRepresentationOptions('LearnRate',1e-3);
Actor = rlStochasticActorRepresentation(net,obsInfo,actInfo,...
'Observation',{'observations'}, actorOptions);
opt = rlPPOAgentOptions('ExperienceHorizon',512,...
'ClipFactor',0.2,...
'EntropyLossWeight',0.02,...
'MiniBatchSize',64,...
'NumEpoch',5,...
'AdvantageEstimateMethod','gae',...
'GAEFactor',0.95,...
'SampleTime',Ts,...
'DiscountFactor',0.9995);
Actor = setLoss(Actor, @actorLossFunction);
agent = rlPPOAgent(Actor,critic,opt);
%prepare train
numEpisodes = 20000;
maxStepsPerEpisode = ceil(Tf/Ts);
discountFactor = 0.995;
aveWindowSize = 100;
trainingTerminationValue = 400;
episodeCumulativeRewardVector = [];
[trainingPlot,lineReward,lineAveReward] = hBuildFigure;
%Start learn
% Enable the training visualization plot.
set(trainingPlot,'Visible','on');
% Train the policy for the maximum number of episodes or until the average
% reward indicates that the policy is sufficiently trained.
for episodeCt = 1:numEpisodes
%sim
simout = sim(agent, env);
% 4. Create training data. Training is performed using batch data. The
% batch size equal to the length of the episode.
%batchSize = min(maxStepsPerEpisode,maxStepsPerEpisode);
batchsize = size(simout.Observation.observations.Data,3);
nextobservationBatch = simout.Observation.observations.Data(:,:,2:batchsize);
actionBatch = simout.Action.Action.Data;
rewardBatch = simout.Reward.Data';
isdonebatch = simout.IsDone.Data';
observationBatch = simout.Observation.observations.Data(:,:,1:batchsize-1);
episoderewardBatch = simout.Reward.Data;
% Compute the discounted future reward.
discountedReturn = zeros(1,batchsize-1);
for t = 1:batchsize-1
G = 0;
for k = t:batchsize-1
G = G + discountFactor ^ (k-t) * rewardBatch(k);
end
discountedReturn(t) = G;
end
%Gather the information needed to learn PPO
Observation{1} = observationBatch; %cellarray
nextobservation{1} = nextobservationBatch;%cellarray
[Advantages, CriticTargets] = computeGeneralizedAdvantage(critic, opt.DiscountFactor, opt.GAEFactor,Observation, nextobservation, rewardBatch,isdonebatch);
Action = actionBatch;
obsDimension{1} = obsInfo.Dimension;
ObsDimsToSlice = cellfun(@(x) numel(x) + 1, obsDimension','UniformOutput',false);
BufferLength = numel(CriticTargets);
%--------------------------------------------------------------------------------------------
OldActionProb = evaluate(Actor, Observation);
OldActionProb = OldActionProb{1};
OldActionProb = evaluate(Actor.SamplingStrategy, OldActionProb, Action);
LossVariable.ClipFactor = opt.ClipFactor;
LossVariable.EntropyLossWeight = opt.EntropyLossWeight;
LossVariable.SamplingStrategy = Actor.SamplingStrategy;
LossVariable.Action = Action;
LossVariable.OldPolicy = OldActionProb;
LossVariable.Advantage = Advantages;
MiniBatchIdx = rl.internal.dataTransformation.getMiniBatchIdx(BufferLength, opt.MiniBatchSize, 1);
for epoch = 1:opt.NumEpoch
for ct = 1:numel(MiniBatchIdx)
% Slice mini batch data
SingleBatchIdx = MiniBatchIdx{ct};
MiniBatchObs = rl.internal.dataTransformation.generalSubref(Observation, SingleBatchIdx, ObsDimsToSlice);
MiniBatchCriticTargets = rl.internal.dataTransformation.generalSubref(CriticTargets, SingleBatchIdx, ndims(CriticTargets));
% REVISIT: support single action channel
LossVariable.Action = rl.internal.dataTransformation.generalSubref(Action, SingleBatchIdx, ndims(Action));
LossVariable.OldPolicy = rl.internal.dataTransformation.generalSubref(OldActionProb, SingleBatchIdx, ndims(OldActionProb));
LossVariable.Advantage = rl.internal.dataTransformation.generalSubref(Advantages, SingleBatchIdx, ndims(Advantages));
% Scale the gradient based on ratio of current minibatch size over specified minibatch size
GradScale = single(numel(SingleBatchIdx)/opt.MiniBatchSize);
GradVal = gradient(critic, 'loss-parameters', MiniBatchObs, MiniBatchCriticTargets);
GradVal = rl.internal.dataTransformation.scaleLearnables(GradVal, GradScale);
critic = optimize(critic, GradVal);
% Update Actor
GradVal = gradient(Actor,'loss-parameters',MiniBatchObs,LossVariable);
GradVal = rl.internal.dataTransformation.scaleLearnables(GradVal, GradScale);
Actor = optimize(Actor, GradVal);
end
end
episodeCumulativeReward = sum(episoderewardBatch);
episodeCumulativeRewardVector = cat(2,...
episodeCumulativeRewardVector,episodeCumulativeReward);
movingAveReward = movmean(episodeCumulativeRewardVector,...
aveWindowSize,2);
addpoints(lineReward,episodeCt,episodeCumulativeReward);
addpoints(lineAveReward,episodeCt,movingAveReward(end));
drawnow;
if max(movingAveReward) > trainingTerminationValue
break
end
end
%plot env
obs = reset(env);
plot(env)
for maxStepsPerEpisode = 1:maxStepsPerEpisode
% Select action according to trained policy
action = getAction(Actor,{obs});
% Step the environment
[nextObs,reward,isdone] = step(env,action{1});
% Check for terminal condition
if isdone
break
end
obs = nextObs;
end
%create Function
function [Advantage, TDTarget] = computeGeneralizedAdvantage(StateValueEstimator, DiscountFactor, GAEFactor, Observation, nextObservation, rewardBatch,isdonebatch)
% Vectorized generalized advantage estimator (GAE)
% REVISIT: current implementation supports single episode
%BatchExperience = getBatchExperience(obj,hasState(StateValueEstimator));
% Unpack experience
% Observation = BatchExperience{1};
% Reward = BatchExperience{3};
% NextObservation = BatchExperience{4};
% IsDone = BatchExperience{5};
SequenceLength = numel(rewardBatch);
% Estimate current and next state values
CurrentStateValue = getValue(StateValueEstimator, Observation);
NextStateValue = getValue(StateValueEstimator, nextObservation);
NextStateValue(isdonebatch == 1) = 0; % early termination
% Vectorized GAE Advantages
% TDError = [TDError(1) TDError(2) ... TDError(4)]
TDError = rewardBatch + ...
reshape(DiscountFactor * NextStateValue - CurrentStateValue, size(rewardBatch));
if GAEFactor == 0
% If GAEFactor == 0, similar to 1 step look ahead (or TD0)
Advantage = TDError;
else
% Adv(1) = TDError(1) + A*TDError(2) + A^2*TDError(3) + A^3*TDError(4)
% Adv(2) = TDError(2) + A^1*TDError(3) + A^2*TDError(4)
% Adv(3) = TDError(3) + A^1*TDError(4)
% Adv(4) = TDError(4)
% ...
% Adv = [TDError(1) TDError(2) ... TDError(4)] * [ 1 0 0 0
% A^1 1 0 0
% A^2 A^1 1 0
% A^3 A^2 A^1 1]
% Adv = TDError * DiscountWeights
WeightsMatrix = repmat((0:SequenceLength-1)',1,SequenceLength) - (0:SequenceLength-1);
%WeightsMatrix =
% [0 -1 -2 -3 -4
% 1 0 -1 -2 -3
% 2 1 0 -1 -2
% 3 2 1 0 -1
% 4 3 2 1 0]
DiscountWeights = tril((DiscountFactor*GAEFactor) .^ WeightsMatrix);
% With A = DiscountFactor*GAELambda, DiscountWeights =
% [ 1 0 0 0
% A^1 1 0 0
% A^2 A^1 1 0
% A^3 A^2 A^1 1]
Advantage = TDError(:)' * DiscountWeights;
end
% Temporal different target = Advantage[s] + V[s]
Advantage = reshape(Advantage, size(CurrentStateValue));
TDTarget = Advantage + CurrentStateValue;
end
%Loss Function
function Loss = actorLossFunction(MeanAndStd, LossVariable)
% Clipped PPO with entropy loss function function for continuous action space
% MeanAndStd: dlarray of current policy action probabilities (model output)
% LossVariable: struct contains
% - SamplingStrategy
% - Action: previous action
% - OldPolicy: old action policy piOld(at|st)
% - Advantage
% - ClipFactor: scalar > 0
% - EntropyLossWeight: scalar where 0 <= EntropyLossWeight <= 1
% Copyright 2019 The MathWorks Inc.
% Extract information from input
Advantage = LossVariable.Advantage;
OldPolicy = LossVariable.OldPolicy;
NumExperience = numel(Advantage);
% compute pi(at|st)
Policy = evaluate(LossVariable.SamplingStrategy, MeanAndStd, LossVariable.Action);
% rt = pi(at|st)/piOld(at|st), avoid division by zero
Ratio = Policy ./ rl.internal.dataTransformation.boundAwayFromZero(OldPolicy);
% obj = rt * At
Advantage = reshape(Advantage, 1, NumExperience);
Objective = Ratio .* Advantage;
ObjectiveClip = max(min(Ratio, 1 + LossVariable.ClipFactor), 1 - LossVariable.ClipFactor) .* Advantage;
% clipped surrogate loss
SurrogateLoss = -sum(min(Objective, ObjectiveClip),'all')/NumExperience;
% entropy loss
EntropyLoss = rl.loss.policyEntropyContinuous(MeanAndStd, ...
LossVariable.EntropyLossWeight,NumExperience);
% total loss
Loss = SurrogateLoss + EntropyLoss;
end
function [trainingPlot, lineReward, lineAveReward] = hBuildFigure()
plotRatio = 16/9;
trainingPlot = figure(...
'Visible','off',...
'HandleVisibility','off', ...
'NumberTitle','off',...
'Name','Cart Pole Custom Training');
trainingPlot.Position(3) = plotRatio * trainingPlot.Position(4);
ax = gca(trainingPlot);
lineReward = animatedline(ax);
lineAveReward = animatedline(ax,'Color','r','LineWidth',3);
xlabel(ax,'Episode');
ylabel(ax,'Reward');
legend(ax,'Cumulative Reward','Average Reward','Location','northwest')
title(ax,'Training Progress');
end

Answers (1)

Anh Tran
Anh Tran on 8 Dec 2020
The approach looks OK, however there is an issue. You must update the agent's actor and critic after each learning iteration. So before call
for episodeCt = 1:numEpisodes
% update actor, critic
agent = setActor(agent,Actor);
agent = setCritic(agent,Critic);
% sim
simout = sim(agent, env);
...
end
Instead of a custom train loop, you can write a custom agent (subclass) to work with a Simulink environment. In this example, we show how to convert a custom train loop into a custom agent. The benefits:
  • Don't recompile Simulink environment each run (your current behavior)
  • Use the train() function and get the reward reporting by default
  • You can also put a debugger inside your custom agent during training
  • With your appraoch, you only update after an episode (or multiple) finished. This is not the case for custom agent, you can learn() whenever

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!