PPO training Stopped Learning.

13 views (last 30 days)
Lloyd
Lloyd on 21 Aug 2024
Answered: Kaustab Pal on 22 Aug 2024
I am trying to train the rotatry inverted pendulum enviroment using a PPO agent. It's working... but It's reaching a limit and not learnign past this limit. I am not too sure why. Newbie to RL here so go easy on me :). I think it's something to do with the yellow line, Q0. Also it could be reaching a local optima, but I don't think this is the problem. I think the problem is with Q0 not getting past 100 and the agent not being able to extract more useful info. Hopefully, someone whith a little more experinace has something to say!
mdl = "rlQubeServoModel";
open_system(mdl)
theta_limit = 5*pi/8;
dtheta_limit = 30;
volt_limit = 12;
Ts = 0.005;
rng(22)
obsInfo = rlNumericSpec([7 1]);
actInfo = rlNumericSpec([1 1],UpperLimit=1,LowerLimit=-1);
agentBlk = mdl + "/RL Agent";
simEnv = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo);
numObs = prod(obsInfo.Dimension);
criticLayerSizes = [400 300];
actorLayerSizes = [400 300];
% critic:
criticNetwork = [
featureInputLayer(numObs)
fullyConnectedLayer(criticLayerSizes(1), ...
Weights=sqrt(2/numObs)*...
(rand(criticLayerSizes(1),numObs)-0.5), ...
Bias=1e-3*ones(criticLayerSizes(1),1))
reluLayer
fullyConnectedLayer(criticLayerSizes(2), ...
Weights=sqrt(2/criticLayerSizes(1))*...
(rand(criticLayerSizes(2),criticLayerSizes(1))-0.5), ...
Bias=1e-3*ones(criticLayerSizes(2),1))
reluLayer
fullyConnectedLayer(1, ...
Weights=sqrt(2/criticLayerSizes(2))* ...
(rand(1,criticLayerSizes(2))-0.5), ...
Bias=1e-3)
];
criticNetwork = dlnetwork(criticNetwork);
summary(criticNetwork)
critic = rlValueFunction(criticNetwork,obsInfo);
% actor:
% Input path layers
inPath = [
featureInputLayer( ...
prod(obsInfo.Dimension), ...
Name="netOin")
fullyConnectedLayer( ...
prod(actInfo.Dimension), ...
Name="infc")
];
% Path layers for mean value
meanPath = [
tanhLayer(Name="tanhMean");
fullyConnectedLayer(prod(actInfo.Dimension));
scalingLayer(Name="scale", ...
Scale=actInfo.UpperLimit)
];
% Path layers for standard deviations
% Using softplus layer to make them non negative
sdevPath = [
tanhLayer(Name="tanhStdv");
fullyConnectedLayer(prod(actInfo.Dimension));
softplusLayer(Name="splus")
];
net = dlnetwork();
net = addLayers(net,inPath);
net = addLayers(net,meanPath);
net = addLayers(net,sdevPath);
net = connectLayers(net,"infc","tanhMean/in");
net = connectLayers(net,"infc","tanhStdv/in");
plot(net)
net = initialize(net);
summary(net)
actor = rlContinuousGaussianActor(net, obsInfo, actInfo, ...
ActionMeanOutputNames="scale",...
ActionStandardDeviationOutputNames="splus",...
ObservationInputNames="netOin");
actorOpts = rlOptimizerOptions(LearnRate=1e-4);
criticOpts = rlOptimizerOptions(LearnRate=1e-4);
agentOpts = rlPPOAgentOptions(...
ExperienceHorizon=600,...
ClipFactor=0.02,...
EntropyLossWeight=0.01,...
ActorOptimizerOptions=actorOpts,...
CriticOptimizerOptions=criticOpts,...
NumEpoch=3,...
AdvantageEstimateMethod="gae",...
GAEFactor=0.95,...
SampleTime=0.1,...
DiscountFactor=0.997);
agent = rlPPOAgent(actor,critic,agentOpts);
trainOpts = rlTrainingOptions(...
MaxEpisodes=20000,...
MaxStepsPerEpisode=600,...
Plots="training-progress",...
StopTrainingCriteria="AverageReward",...
StopTrainingValue=430,...
ScoreAveragingWindowLength=100);
trainingStats = train(agent, simEnv, trainOpts);
thanks in advanced!

Answers (2)

arushi
arushi on 22 Aug 2024
Edited: arushi on 22 Aug 2024
Hi Lloyd,
Some potential reasons why your training might be hitting a plateau and not improving further:
Q0 and Learning Plateau:
  • The variable Q0 might refer to the initial Q-value or a specific parameter in your environment or model. If it's not progressing past a certain point, it might be due to insufficient exploration or suboptimal hyperparameters.
Exploration vs. Exploitation:
  • Ensure your agent is exploring adequately. The entropy loss weight (EntropyLossWeight) in PPO helps encourage exploration by adding randomness to the policy. You might try increasing this value slightly to see if it helps the agent explore more diverse actions.
Learning Rates:
  • The learning rates for both the actor and critic (LearnRate=1e-4) might be too low or too high. Experiment with different learning rates, such as 1e-3 or 5e-5, to see if the agent's performance improves.
Clip Factor:
  • The clip factor (ClipFactor=0.02) controls how much the policy is allowed to change at each update. If it's too restrictive, the agent might not learn effectively. Try increasing it to 0.1 or 0.2.
Reward Function:
  • Ensure your reward function is well-designed and provides sufficient feedback for the agent to learn effectively. If the reward is sparse or doesn't align well with the task objectives, the agent may struggle to learn.
Hope this helps.

Kaustab Pal
Kaustab Pal on 22 Aug 2024
Hi @Lloyd,
The yellow line, Q0, in the plot represents the estimate of the discounted long-term reward at the start of each episode, based on the initial observation of the environment. Ideally, as training progresses and if the critic is well-designed and learning effectively, the average Q0 should converge towards the actual discounted long-term reward (depicted by the dark-blue line).
In your case, it seems that around episode 2000, Q0 ceases to improve, indicating that the critic may have stopped learning. This is a common challenge in reinforcement learning. Here are a few suggestions to address this:
  1. Reward function: Ensure that your reward function effectively guides the agent towards the desired behavior. Consider normalizing the rewards before training your agent.
  2. Hyperparameter tuning: Experiment with different values for hyperparameters such as the learning rate, clip factor, and entropy loss weight.
  3. You might want to add more layers to your critic network to enhance its capacity to learn complex information. However, be cautious of overfitting when adding too many layers.
For more information, you can refer to the following documentations:
Hope this is helpful.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!