DDPG Control - for non-linear plant control - Q0 does not converge even after 5,000 episodes

Question

Rajesh Siraskar on 1 Dec 2019

1
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/494121-ddpg-control-for-non-linear-plant-control-q0-does-not-converge-even-after-5-000-episodes

Commented: Zonghao zou on 19 Oct 2020

Dear Matlab,

Firstly, I must say being RL into the MATLAB platform and have the capability to integrate to Simulink is just so exciting for an ML engineer, used to Python otherwise. I believe this will evolve and cause tremendous excitment in the engineering organizations worldwide. I am so happy to be an early adopter.

I am comparing a PID control versus RL control for a non-linear valve model.

I used the water-tank control DDPG example MATLAB provides is a good starting point. I used a similar strategy of moving the reference signal randomly (within 2 and 10) and moving the inital state of flow randomly (again within 2 and 10).

I expected that the Episode Manager plots will look similar but the Q0 does not converge. I've tried training for 5,000 and once till 10,000 as well.

The critic and actor designs are similar to water-tank. Attached images show the PID model, the RL model, plant and episode manager plots.

Do you please have any suggestions?

I have gone through some similar posts here including this and expert Enrico Anderlini suggestions

I did try modifying the exploration parameters a bit (programmed variance at 0.5 and decay at 1e-4. But that didn't seem to help much.

Images attached:

1. PID Control

2. RL Control

2.b. Non-linear Plant_model

3. RL at 400 episodes

4. RL at 700 episodes

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Emmanouil Tzorakoleftherakis on 27 Jan 2020

0
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/494121-ddpg-control-for-non-linear-plant-control-q0-does-not-converge-even-after-5-000-episodes#answer_412229

Hi Rajesh,

Looks to me that this problem has converged. Ideally, the Q0 curve should eventually overlap with the average episode reward curve, but there is no specific timeframe for that. You could very well learn a decent policy before the critic/Q0 is fully trained.

In the plots you attached, it seems that the critic is sensitive to ICs of the problem (since you are randomizing the ICs you see this "noisy" Q0 curve). To improve the behavior of the critic, you could consider scaling the observations to small numbers (e.g. -1 to 1) such that the neural network weights don't blow up, or e.g. you could consider using tanh activation instead.

Hope this helps!

1 Comment
Show -1 older commentsHide -1 older comments

Zonghao zou on 19 Oct 2020

A question on convergence. Is there a way to set the stopping condition as Q_0 roughly equals to average episode reward? Since for my case, I do not have any pre-knowledge about what the stopping condition will be.

Thanks

Sign in to comment.

DDPG Control - for non-linear plant control - Q0 does not converge even after 5,000 episodes

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

1 Comment
Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

DDPG Control - for non-linear plant control - Q0 does not converge even after 5,000 episodes

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

1 Comment Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

1 Comment
Show -1 older commentsHide -1 older comments