Reinforcement Learning Toolbox: DDPG Agent, Q0 diverging to very high values during training

11 views (last 30 days)
I made a DDPG reinforcement learning agent to control a simulink environment. Its similar to the water tank level example problem, the agent performs adjustments on the process speed and recieves rewards if an output parameter is inside a specified range, and recieves a big negative reward if this output parameter goes over a specified threshold.
I started with simple network architectures, around 15-20 neurons and one to three layers, then i went all the way up to 100 neurons in each layer and four to five layers in both the critic and actor networks. I also tried reducing the learning rate to 1e-4.
The process takes around 150-300 timesteps, and the reward is 1 point for each timestep that the output parameter is inside the specified range, so the maximum reward possible should be around 150-300, depending on the process speed.
However, regardless of the chosen network arcitechture, the q0 just diverges to very high values every training session, around 10e8, then flattens out, while the episode reward bounces around between -1000 and 150 (see attached figure). This pattern persists even after 80 000+ episodes (three days of training). I have read that the q0 and the episode reward should converge if everything is set up correctly, so something is definitely wrong.
The optimal process speed should follow some sort of S-shape to collect the most rewards. However, every time i stop the training the agent just predicts a constant action value for every time step or no action at all, resulting in a linearly increasing or constant process speed which does poorly in terms of reward.
Any idea what I am doing wrong?
Thank you for your time!

Answers (1)

Emmanouil Tzorakoleftherakis
Hi Johan,
It makes sense that stopping the training leads to bad actions since the blown-up critic values probably don't lead to any significant learning. Could you share a repro example? It is hard to guess what's wrong here otherwise.
Also, have a look at this answer for some additional suggestions. My guess is that you are using too many layers/neurons for the critic.

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!