DDPG Agent (used to set a temperature) 41% faster training time per Episode with Warm-up than without. Why?

Question

Milan B on 12 Jan 2024

0
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/2069226-ddpg-agent-used-to-set-a-temperature-41-faster-training-time-per-episode-with-warm-up-than-withou

Commented: Milan B on 13 Jan 2024

Accepted Answer: Venu

Open in MATLAB Online

Hi,

So I noticed something while training my DDPG Agent.

I use a DDPG Agent to set a temperature for a heating system depending on the weather forecast and other temperatures such as the outside temperature.

First I trained an Agent without any warm-up and then I trained another new Agent with a warm-up of 700 episodes. It did what I had hoped, converging faster and finding a much better strategy than without the warm-up. I also noticed that the training time was much faster. I have calculated that it takes 41% less time to train an episode than the training time for one episode without a warm-up.

Don't get me wrong, I really appreciate this, but I am trying to understand why.

I have not changed any of the agent options, just the warm-up.

If the agent is supposed to win a game as quickly as possible, I would understand that because of the experience in the warm-up, the agent would find a better strategy faster to win the game faster, so it would take less time per episode to win the game, but in my case the agent should just set a temperature. There is no faster way to set a temperature.

Am I missing an important point?

I mean, in every training step and every episode the process is more or less the same. Set an action, get a reward, update the networks, update the policy and so on. Where in those steps could the 41% time improvement be?

Just to be clear, I understand why it converges faster, I just don't understand why the training time per episode is so much faster. Without a warm-up, the average training time per episode was 28.1 seconds. With a warm-up it was 16.5 seconds.

These are my agent options, which I used for both agents:

agent.AgentOptions.TargetSmoothFactor = 1e-3;
agent.AgentOptions.DiscountFactor = 1.0;
agent.AgentOptions.MiniBatchSize = 128;
agent.AgentOptions.ExperienceBufferLength = 1e6; 
agent.AgentOptions.NoiseOptions.Variance = 0.5;
agent.AgentOptions.NoiseOptions.VarianceDecayRate = 1e-6;
agentOptions.ResetExperienceBufferBeforeTraining = false;
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-03;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-04;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

I also use the Reinforcement Learning Toolbox and normalised all my variables in both cases.

In general, everything works fine, but it drives me crazy that I can't understand why it's so much faster.

Maybe someone has an idea.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Venu on 13 Jan 2024

1
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/2069226-ddpg-agent-used-to-set-a-temperature-41-faster-training-time-per-episode-with-warm-up-than-withou#answer_1388826

Hi @Milan B,

Based on info u have provided, I can infer the following points:

With warm-up experiences, the agent might be exploring the state and action space more efficiently.
The learning rates for your critic and actor networks are set to allow for small updates. With a good initial experience buffer, the updates may be more stable and require fewer adjustments, leading to faster convergence and less time spent on each gradient update step.
You mentioned that 'agentOptions.ResetExperienceBufferBeforeTraining' is set to 'false'. If the buffer is not reset, the agent with warm-up starts with a full buffer of experiences, which could lead to more efficient sampling and less time waiting for the buffer to fill up.

1 Comment
Show -1 older commentsHide -1 older comments

Milan B on 13 Jan 2024

@Venu thanks for the Answer.

Interesting aspects! Especially the second point regarding the "less time spent on each gradient update step". Does this mean that the gradients are updated more efficiently due to the better quality of experiences drawn from the Buffer? I am currently using the L2-Norm as a Gradient Threshold Method with a set Gradient Threshold of 1. My understanding is that if the gradient updates are suboptimal due to insufficient experience, it's more likely that this threshold will be exceeded. Consequently, this necessitates clipping the gradient using the L2-Norm, which is a time-consuming process.

Could this be a possible explanation? I mean, sure there are other factors, but this is what I thought when I heard faster gradient updates.

Sign in to comment.

DDPG Agent (used to set a temperature) 41% faster training time per Episode with Warm-up than without. Why?

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

1 Comment
Show -1 older commentsHide -1 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

DDPG Agent (used to set a temperature) 41% faster training time per Episode with Warm-up than without. Why?

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

1 Comment Show -1 older commentsHide -1 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

1 Comment
Show -1 older commentsHide -1 older comments