On updating the policy with sim functions and Custom Loop

Question

shoki kobayashi on 30 Nov 2020

0
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/669538-on-updating-the-policy-with-sim-functions-and-custom-loop

Commented: jiayi on 25 Apr 2023

I'm currently trying to train the PPOAgent using the sim function and Custom Loop. However, when I use the sim function, the Actor and Critic networks don't update properly and keep repeating the same behavior. How can I get the network updates to work? Is it not a good idea to use the sim function instead of step in the first place... I think the sim function is the only way to do custom loops in the simlink environment, since step has the image of storing a history of actions, states and rewards in a buffer. We want to do it with a policy of not using functions.

% PPO without using the train function
clear all
rng(0)
%Construct Environment
mdl = 'cartpole';
open_system(mdl)
env = rlPredefinedEnv('CartPoleSimscapeModel-Continuous')
obsInfo = getObservationInfo(env);
numObs = obsInfo.Dimension(1);
actInfo = getActionInfo(env);
numAct = actInfo.Dimension(1);
Ts = 0.02;
Tf = 25;
PPOAgent
criticLayerSizes = [128 200];
actorLayerSizes = [128 200];
createNetworkWeights;
criticNetwork = [imageInputLayer([numObs 1 1],'Normalization','none','Name','observations')
    fullyConnectedLayer(criticLayerSizes(1),'Name','CriticFC1')
    reluLayer('Name','CriticRelu1')
    fullyConnectedLayer(criticLayerSizes(2),'Name','CriticFC2')
    reluLayer('Name','CriticRelu2')
    fullyConnectedLayer(1,'Name','CriticOutput')
                          ];
                      
criticOpts = rlRepresentationOptions('LearnRate',1e-3);
critic = rlValueRepresentation(criticNetwork,env.getObservationInfo, ...
                          'Observation',{'observations'},criticOpts);
%ActorNetwork
     
inPath = [ imageInputLayer([numObs 1 1], 'Normalization','none','Name','observations') 
           fullyConnectedLayer(numAct,'Name','infc') ]; % 2 by 1 output
% path layers for mean value (2 by 1 input and 2 by 1 output)
% using scalingLayer to scale the range
meanPath = [ tanhLayer('Name','tanh'); % output range: (-1,1)
             scalingLayer('Name','scale','Scale',actInfo.UpperLimit) ]; % output range: (-10,10)
% path layers for standard deviations (2 by 1 input and output)
% using softplus layer to make it non negative
sdevPath =  softplusLayer('Name', 'splus');
outLayer = concatenationLayer(3,2,'Name','mean&sdev');
% add layers to network object
net = layerGraph(inPath);
net = addLayers(net,meanPath);
net = addLayers(net,sdevPath);
net = addLayers(net,outLayer);
% connect layers: the mean value path output MUST be connected to the FIRST input of the concatenationLayer
net = connectLayers(net,'infc','tanh/in');              % connect output of inPath to meanPath input
net = connectLayers(net,'infc','splus/in');             % connect output of inPath to sdevPath input
net = connectLayers(net,'scale','mean&sdev/in1');       % connect output of meanPath to gaussPars input #1
net = connectLayers(net,'splus','mean&sdev/in2');       % connect output of sdevPath to gaussPars input #2
actorOptions = rlRepresentationOptions('LearnRate',1e-3);
Actor = rlStochasticActorRepresentation(net,obsInfo,actInfo,... 
                         'Observation',{'observations'}, actorOptions);
                     
opt = rlPPOAgentOptions('ExperienceHorizon',512,...
                        'ClipFactor',0.2,...
                        'EntropyLossWeight',0.02,...
                        'MiniBatchSize',64,...
                        'NumEpoch',5,...
                        'AdvantageEstimateMethod','gae',...
                        'GAEFactor',0.95,...
                        'SampleTime',Ts,...
                        'DiscountFactor',0.9995);
 Actor = setLoss(Actor, @actorLossFunction);
 agent = rlPPOAgent(Actor,critic,opt); 
%prepare train
numEpisodes = 20000;
maxStepsPerEpisode = ceil(Tf/Ts);
discountFactor = 0.995;
aveWindowSize = 100;
trainingTerminationValue = 400;
episodeCumulativeRewardVector = [];
[trainingPlot,lineReward,lineAveReward] = hBuildFigure;
%Start learn
% Enable the training visualization plot.
set(trainingPlot,'Visible','on');
% Train the policy for the maximum number of episodes or until the average
% reward indicates that the policy is sufficiently trained.
for episodeCt = 1:numEpisodes
    %sim
    simout = sim(agent, env);
    % 4. Create training data. Training is performed using batch data. The
    % batch size equal to the length of the episode.
   %batchSize = min(maxStepsPerEpisode,maxStepsPerEpisode);
    batchsize = size(simout.Observation.observations.Data,3);
    nextobservationBatch = simout.Observation.observations.Data(:,:,2:batchsize);
    actionBatch = simout.Action.Action.Data;
    rewardBatch = simout.Reward.Data';
    isdonebatch = simout.IsDone.Data';
    observationBatch = simout.Observation.observations.Data(:,:,1:batchsize-1);
    episoderewardBatch = simout.Reward.Data;
    % Compute the discounted future reward.
    discountedReturn = zeros(1,batchsize-1);
    for t = 1:batchsize-1
        G = 0;
        for k = t:batchsize-1
            G = G + discountFactor ^ (k-t) * rewardBatch(k);
        end
        discountedReturn(t) = G;
    end
%Gather the information needed to learn PPO
    Observation{1} =  observationBatch; %cellarray
    nextobservation{1} = nextobservationBatch;%cellarray    
    [Advantages, CriticTargets] = computeGeneralizedAdvantage(critic, opt.DiscountFactor, opt.GAEFactor,Observation, nextobservation, rewardBatch,isdonebatch);
    Action = actionBatch;
    obsDimension{1} = obsInfo.Dimension;
    ObsDimsToSlice = cellfun(@(x) numel(x) + 1, obsDimension','UniformOutput',false);
    BufferLength = numel(CriticTargets);
%-------------------------------------------------------------------------------------------- 
    OldActionProb = evaluate(Actor, Observation);
    OldActionProb = OldActionProb{1};
    OldActionProb = evaluate(Actor.SamplingStrategy, OldActionProb, Action);
    
    LossVariable.ClipFactor = opt.ClipFactor;
    LossVariable.EntropyLossWeight = opt.EntropyLossWeight;
    LossVariable.SamplingStrategy = Actor.SamplingStrategy;
    LossVariable.Action = Action;
    LossVariable.OldPolicy = OldActionProb;
    LossVariable.Advantage = Advantages;
    
    MiniBatchIdx = rl.internal.dataTransformation.getMiniBatchIdx(BufferLength, opt.MiniBatchSize, 1);
for epoch = 1:opt.NumEpoch
     for ct = 1:numel(MiniBatchIdx)
        % Slice mini batch data
        SingleBatchIdx = MiniBatchIdx{ct};
        MiniBatchObs           = rl.internal.dataTransformation.generalSubref(Observation, SingleBatchIdx, ObsDimsToSlice);
        MiniBatchCriticTargets = rl.internal.dataTransformation.generalSubref(CriticTargets, SingleBatchIdx, ndims(CriticTargets));
        % REVISIT: support single action channel
        LossVariable.Action    = rl.internal.dataTransformation.generalSubref(Action, SingleBatchIdx, ndims(Action));
        LossVariable.OldPolicy = rl.internal.dataTransformation.generalSubref(OldActionProb, SingleBatchIdx, ndims(OldActionProb));
        LossVariable.Advantage = rl.internal.dataTransformation.generalSubref(Advantages, SingleBatchIdx, ndims(Advantages));
        
        % Scale the gradient based on ratio of current minibatch size over specified minibatch size
        GradScale = single(numel(SingleBatchIdx)/opt.MiniBatchSize);
        
        GradVal = gradient(critic, 'loss-parameters', MiniBatchObs, MiniBatchCriticTargets);
        GradVal = rl.internal.dataTransformation.scaleLearnables(GradVal, GradScale);
        critic = optimize(critic, GradVal);
  
        
        % Update Actor
        GradVal = gradient(Actor,'loss-parameters',MiniBatchObs,LossVariable);
        GradVal = rl.internal.dataTransformation.scaleLearnables(GradVal, GradScale);
        Actor = optimize(Actor, GradVal);
     end
end
    episodeCumulativeReward = sum(episoderewardBatch);
    episodeCumulativeRewardVector = cat(2,...
        episodeCumulativeRewardVector,episodeCumulativeReward);
    movingAveReward = movmean(episodeCumulativeRewardVector,...
        aveWindowSize,2);
    addpoints(lineReward,episodeCt,episodeCumulativeReward);
    addpoints(lineAveReward,episodeCt,movingAveReward(end));
    drawnow;
    if max(movingAveReward) > trainingTerminationValue
        break
    end
    
end
%plot env
obs = reset(env);
plot(env)
for maxStepsPerEpisode = 1:maxStepsPerEpisode
    
    % Select action according to trained policy
    action = getAction(Actor,{obs});
        
    % Step the environment
    [nextObs,reward,isdone] = step(env,action{1});
    
    % Check for terminal condition
    if isdone
        break
    end
    
    obs = nextObs;
    
end
%create Function
function [Advantage, TDTarget] = computeGeneralizedAdvantage(StateValueEstimator, DiscountFactor, GAEFactor, Observation, nextObservation, rewardBatch,isdonebatch)
    % Vectorized generalized advantage estimator (GAE)
    % REVISIT: current implementation supports single episode
    
    %BatchExperience = getBatchExperience(obj,hasState(StateValueEstimator));
    
    % Unpack experience
%     Observation       = BatchExperience{1};
%     Reward            = BatchExperience{3};
%     NextObservation   = BatchExperience{4};
%     IsDone            = BatchExperience{5};
    SequenceLength = numel(rewardBatch);
    
    % Estimate current and next state values
    CurrentStateValue = getValue(StateValueEstimator, Observation);
    NextStateValue = getValue(StateValueEstimator, nextObservation);
    NextStateValue(isdonebatch == 1) = 0; % early termination
    
    % Vectorized GAE Advantages
    % TDError = [TDError(1) TDError(2) ... TDError(4)]
    TDError = rewardBatch + ...
        reshape(DiscountFactor * NextStateValue - CurrentStateValue, size(rewardBatch));
    if GAEFactor == 0
        % If GAEFactor == 0, similar to 1 step look ahead (or TD0)
        Advantage = TDError;
    else
        % Adv(1) = TDError(1) + A*TDError(2) + A^2*TDError(3) + A^3*TDError(4)
        % Adv(2) =                TDError(2) + A^1*TDError(3) + A^2*TDError(4)
        % Adv(3) =                                 TDError(3) + A^1*TDError(4)
        % Adv(4) =                                                  TDError(4)
        % ...
        % Adv    = [TDError(1) TDError(2) ... TDError(4)] * [  1     0     0    0
        %                                                    A^1     1     0    0
        %                                                    A^2   A^1     1    0
        %                                                    A^3   A^2   A^1    1]
        % Adv    =                   TDError              *  DiscountWeights
        
        WeightsMatrix = repmat((0:SequenceLength-1)',1,SequenceLength) - (0:SequenceLength-1);
        %WeightsMatrix =
        %   [0   -1   -2   -3   -4
        %    1    0   -1   -2   -3
        %    2    1    0   -1   -2
        %    3    2    1    0   -1
        %    4    3    2    1    0]
        DiscountWeights = tril((DiscountFactor*GAEFactor) .^ WeightsMatrix);
        % With A = DiscountFactor*GAELambda, DiscountWeights =
        %  [  1     0     0    0
        %   A^1     1     0    0
        %   A^2   A^1     1    0
        %   A^3   A^2   A^1    1]
        Advantage = TDError(:)' * DiscountWeights;
    end
    % Temporal different target = Advantage[s] + V[s]
    Advantage = reshape(Advantage, size(CurrentStateValue));
    TDTarget = Advantage + CurrentStateValue;
end
%Loss Function
function Loss = actorLossFunction(MeanAndStd, LossVariable)
% Clipped PPO with entropy loss function function for continuous action space
%   MeanAndStd: dlarray of current policy action probabilities (model output)
%   LossVariable: struct contains
%       - SamplingStrategy
%       - Action: previous action
%       - OldPolicy: old action policy piOld(at|st)
%       - Advantage
%       - ClipFactor: scalar > 0
%       - EntropyLossWeight: scalar where 0 <= EntropyLossWeight <= 1
% Copyright 2019 The MathWorks Inc.
% Extract information from input
Advantage = LossVariable.Advantage;
OldPolicy = LossVariable.OldPolicy;
NumExperience = numel(Advantage);
% compute pi(at|st)
Policy = evaluate(LossVariable.SamplingStrategy, MeanAndStd, LossVariable.Action);
% rt = pi(at|st)/piOld(at|st), avoid division by zero
Ratio = Policy ./ rl.internal.dataTransformation.boundAwayFromZero(OldPolicy);
% obj = rt * At
Advantage = reshape(Advantage, 1, NumExperience);
Objective = Ratio .* Advantage;
ObjectiveClip = max(min(Ratio, 1 + LossVariable.ClipFactor), 1 - LossVariable.ClipFactor) .* Advantage;
% clipped surrogate loss
SurrogateLoss = -sum(min(Objective, ObjectiveClip),'all')/NumExperience;
% entropy loss
EntropyLoss = rl.loss.policyEntropyContinuous(MeanAndStd, ...
    LossVariable.EntropyLossWeight,NumExperience);
% total loss
Loss = SurrogateLoss + EntropyLoss;
end
function [trainingPlot, lineReward, lineAveReward] = hBuildFigure()
    plotRatio = 16/9;
    trainingPlot = figure(...
                'Visible','off',...
                'HandleVisibility','off', ...
                'NumberTitle','off',...
                'Name','Cart Pole Custom Training');
    trainingPlot.Position(3) = plotRatio * trainingPlot.Position(4);
    
    ax = gca(trainingPlot);
    
    lineReward = animatedline(ax);
    lineAveReward = animatedline(ax,'Color','r','LineWidth',3);
    xlabel(ax,'Episode');
    ylabel(ax,'Reward');
    legend(ax,'Cumulative Reward','Average Reward','Location','northwest')
    title(ax,'Training Progress');
end

1 Comment
Show -1 older commentsHide -1 older comments

jiayi on 25 Apr 2023

What is the Actor.SamplingStrategy and how was it obtained？

Sign in to comment.

Sign in to answer this question.

Answer 1

Anh Tran on 8 Dec 2020

0
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/669538-on-updating-the-policy-with-sim-functions-and-custom-loop#answer_569093

The approach looks OK, however there is an issue. You must update the agent's actor and critic after each learning iteration. So before call

for episodeCt = 1:numEpisodes
    % update actor, critic
    agent = setActor(agent,Actor);
    agent = setCritic(agent,Critic);
    
    % sim
    simout = sim(agent, env);
    ...
end

Instead of a custom train loop, you can write a custom agent (subclass) to work with a Simulink environment. In this example, we show how to convert a custom train loop into a custom agent. The benefits:

Don't recompile Simulink environment each run (your current behavior)
Use the train() function and get the reward reporting by default
You can also put a debugger inside your custom agent during training
With your appraoch, you only update after an episode (or multiple) finished. This is not the case for custom agent, you can learn() whenever