DDPG Control - for non-linear plant control - Q0 does not converge even after 5,000 episodes

2 views (last 30 days)
Dear Matlab,
Firstly, I must say being RL into the MATLAB platform and have the capability to integrate to Simulink is just so exciting for an ML engineer, used to Python otherwise. I believe this will evolve and cause tremendous excitment in the engineering organizations worldwide. I am so happy to be an early adopter.
I am comparing a PID control versus RL control for a non-linear valve model.
I used the water-tank control DDPG example MATLAB provides is a good starting point. I used a similar strategy of moving the reference signal randomly (within 2 and 10) and moving the inital state of flow randomly (again within 2 and 10).
I expected that the Episode Manager plots will look similar but the Q0 does not converge. I've tried training for 5,000 and once till 10,000 as well.
The critic and actor designs are similar to water-tank. Attached images show the PID model, the RL model, plant and episode manager plots.
Do you please have any suggestions?
I have gone through some similar posts here including this and expert Enrico Anderlini suggestions
I did try modifying the exploration parameters a bit (programmed variance at 0.5 and decay at 1e-4. But that didn't seem to help much.
Images attached:
1. PID Control
2. RL Control
2.b. Non-linear Plant_model
3. RL at 400 episodes
4. RL at 700 episodes

Answers (1)

Emmanouil Tzorakoleftherakis
Hi Rajesh,
Looks to me that this problem has converged. Ideally, the Q0 curve should eventually overlap with the average episode reward curve, but there is no specific timeframe for that. You could very well learn a decent policy before the critic/Q0 is fully trained.
In the plots you attached, it seems that the critic is sensitive to ICs of the problem (since you are randomizing the ICs you see this "noisy" Q0 curve). To improve the behavior of the critic, you could consider scaling the observations to small numbers (e.g. -1 to 1) such that the neural network weights don't blow up, or e.g. you could consider using tanh activation instead.
Hope this helps!
  1 Comment
Zonghao zou
Zonghao zou on 19 Oct 2020
A question on convergence. Is there a way to set the stopping condition as Q_0 roughly equals to average episode reward? Since for my case, I do not have any pre-knowledge about what the stopping condition will be.
Thanks

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!