Why DQN training always fails to converge to the optimal value

71 views (last 30 days)
The following conditions have been consistently occurring:
I run DQN in the Simulink environment, a model with three observation values and two action output values.The reward function is the average power generated by the system within 10 seconds, so the sampling time is 10 seconds.I wrote an s-function to calculate the average power obtained every 10 seconds and input it into the reward.
This is the main function part of the s-function for calculating rewards:
function Update(block)
P =block.InputPort(1).Data;
Psum =block.InputPort(2).Data;
time =block.InputPort(3).Data;
flag=block.InputPort(4).Data; %Which sampling time point is marked by flag
if time == 0
Psum = 0;
else
Psum = Psum + P; %Instantaneous power accumulation
end
Ts=10;
if time<Ts
flag=1;
end
if time>=Ts*flag
Pavg = Psum / Ts; %Calculate the average value at each sampling time
Psum = 0;
flag=flag+1;
else
Pavg =Psum / Ts;
end
D=[Pavg;Psum;flag];
block.Dwork(1).Data =D;
Why does RL ultimately converge to a non optimal solution?
Where should I start to solve it?

Answers (2)

Shubh
Shubh on 28 Dec 2023
Hi,
I understand that you are having some issues with the convergence of RL algorithm.
Here are a few steps and considerations that could help you identify and solve the issue:
  1. Reward Function: Ensure that the reward function accurately represents the objective you wish to achieve. It should provide positive reinforcement for desirable behavior and negative reinforcement for undesirable behavior. Since you're using the average power generated as a reward, make sure that it aligns with the system's performance you want to optimize.
  2. Exploration vs. Exploitation: DQNs typically use an epsilon-greedy policy to balance exploration and exploitation. If your model is converging to a non-optimal solution, it might be that it is not exploring enough. Consider adjusting the parameters that control the exploration, such as epsilon decay rate or the initial value of epsilon.
  3. Learning Rate: The learning rate is crucial in DQN. If it's too high, the model may oscillate or diverge; if it's too low, it may converge too slowly or get stuck in a local minimum. You mentioned using a learning rate of 0.0003, which may or may not be suitable depending on your specific environment and model architecture.
  4. Neural Network Architecture: The architecture of the neural network in DQN can greatly affect performance. Make sure the network is complex enough to learn the policy but not so complex that it overfits or fails to generalize.
  5. Discount Factor: The discount factor determines the importance of future rewards. A value too close to 0 will make the agent short-sighted, while a value too high can cause the agent to overestimate the importance of distant rewards.
  6. Training Duration and Stopping Criteria: It’s possible that the training hasn't converged yet, and you might need to train for more episodes. Also, the criteria you use to stop training can influence the solution to which your DQN converges.
  7. Stability and Variance: DQN can have high variance in performance. Techniques like experience replay and target networks are designed to stabilize training. Make sure these are implemented correctly.
  8. Sampling Time: Since your reward is calculated over a period of 10 seconds, ensure that this window is appropriate for capturing the dynamics of the environment and the actions taken by the agent.
  9. Hyperparameter Tuning: Consider running a systematic hyperparameter optimization to find the best set of parameters for your specific problem.
  10. Debugging the S-Function: Ensure that the S-function for reward calculation is working as intended. You could log the values of Pavg, Psum, and flag at each step to ensure they're being calculated and updated correctly.
  11. Inspect Training Progress: Based on the images you've provided, it seems there is a lot of variance in the episode reward. This could indicate that the agent has not learned a stable policy. Inspect the training progress and consider smoothing the reward signal if necessary.
  12. Algorithm Suitability: Finally, consider whether DQN is the right choice for your problem. In some cases, other reinforcement learning algorithms might be more suitable.
To start solving the problem, you might want to first ensure your reward function and S-function are implemented correctly and are providing the agent with the right incentives. Then proceed with a careful hyperparameter tuning and ensure that the neural network architecture is appropriate for the task at hand. If the problem persists, consider using techniques like reward shaping, or switch to a different reinforcement learning algorithm that might be better suited to your environment.
Hope this helps!

Emmanouil Tzorakoleftherakis
What I am seeing here is that the average reward tends to converge to the Q0 profile which is the expected behavior of a converging DQN agent. If this trained agent does not lead to the desired behavior, the first thing I would do is modify the reward signal. For example I am a little sceptical when I see reward values that blow up like the ones above. Maybe try to normalize observations and the reward as well to facilitate training.

Tags

Products


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!