Why does my Custom TD3 not learn like the built-in TD3 agent?

77 views (last 30 days)
Vincent
Vincent on 18 Aug 2025 at 15:16
Edited: Vincent on 23 Aug 2025 at 13:26
So I have tried to code up my custom TD3 agent to behave as much like the built in TD3 agent in the same simulink environment, the only difference between them is for the custom agent, I had to use a rate transition block to perform zero order hold between the states, rewards, done signal and the custom agent. I used the rate transition block specify mode for output port sample time options to set the custom agent sample time.
My code for my custom TD3 agent is below, I tried to make it as much like the built-in TD3 as possible, the ep_counter,num_of_ep properties are unused.
classdef test_TD3Agent_V2 < rl.agent.CustomAgent
properties
%neural networks
actor
critic1
critic2
%target networks
target_actor
target_critic1
target_critic2
%dimensions
statesize
actionsize
%optimizers
actor_optimizer
critic1_optimizer
critic2_optimizer
%buffer
statebuffer
nextstatebuffer
actionbuffer
rewardbuffer
donebuffer
counter %keeps count of number experiences encountered
index %keeps track of current available index in buffer
buffersize
batchsize
%episodes
num_of_ep
ep_counter
%keep count of critic number of updates
num_critic_update
end
methods
%constructor
function obj = test_TD3Agent_V2(actor,critic1,critic2,target_actor,target_critic1,target_critic2,actor_opt,critic1_opt,critic2_opt,statesize,actionsize,buffer_size,batchsize,num_of_ep)
%(required) call abstract class constructor
obj = obj@rl.agent.CustomAgent();
%define observation + action space
obj.ObservationInfo = rlNumericSpec([statesize 1]);
obj.ActionInfo = rlNumericSpec([actionsize 1],LowerLimit = -1,UpperLimit = 1);
obj.SampleTime = -1; %determined by rate transition block
%define networks
obj.actor = actor;
obj.critic1 = critic1;
obj.critic2 = critic2;
%define target networks
obj.target_actor = target_actor;
obj.target_critic1 = target_critic1;
obj.target_critic2 = target_critic2;
%define optimizer
obj.actor_optimizer = actor_opt;
obj.critic1_optimizer = critic1_opt;
obj.critic2_optimizer = critic2_opt;
%record dimensions
obj.statesize = statesize;
obj.actionsize = actionsize;
%initialize buffer
obj.statebuffer = dlarray(zeros(statesize,1,buffer_size));
obj.nextstatebuffer = dlarray(zeros(statesize,1,buffer_size));
obj.actionbuffer = dlarray(zeros(actionsize,1,buffer_size));
obj.rewardbuffer = dlarray(zeros(1,buffer_size));
obj.donebuffer = zeros(1,buffer_size);
obj.buffersize = buffer_size;
obj.batchsize = batchsize;
obj.counter = 0;
obj.index = 1;
%episodes (unused)
obj.num_of_ep = num_of_ep;
obj.ep_counter = 1;
%used for delay actor update and target network soft transfer
obj.num_critic_update = 0;
end
end
methods (Access = protected)
%Action method
function action = getActionImpl(obj,Observation)
% Given the current state of the system, return an action
action = getAction(obj.actor,Observation);
end
%Action with noise method
function action = getActionWithExplorationImpl(obj,Observation)
% Given the current observation, select an action
action = getAction(obj.actor,Observation);
% Add random noise to action
end
%Learn method
function action = learnImpl(obj,Experience)
%parse experience
state = Experience{1};
action_ = Experience{2};
reward = Experience{3};
next_state = Experience{4};
isdone = Experience{5};
%buffer operations
%check if index wraps around
if (obj.index > obj.buffersize)
obj.index = 1;
end
%record experience in buffer
obj.statebuffer(:,:,obj.index) = state{1};
obj.actionbuffer(:,:,obj.index) = action_{1};
obj.rewardbuffer(:,obj.index) = reward;
obj.nextstatebuffer(:,:,obj.index) = next_state{1};
obj.donebuffer(:,obj.index) = isdone;
%increment index and counter
obj.counter = obj.counter + 1;
obj.index = obj.index + 1;
%if non terminal state
if (isdone == false)
action = getAction(obj.actor,next_state); %select next action
noise = randn([6,1]).*0.1; %gaussian noise with standard dev of 0.1
action{1} = action{1} + noise; %add noise
action{1} = clip(action{1},-1,1); %clip action noise
else
%learning at the end of episode
if (obj.counter >= obj.batchsize)
max_index = min([obj.counter obj.buffersize]); %range of index 1 to max_index for buffer
%sample experience randomly from buffer
sample_index_vector = randsample(max_index,obj.batchsize); %vector of index experience to sample
%create buffer mini batch dlarrays
state_batch = dlarray(zeros(obj.statesize,1,obj.batchsize));
nextstate_batch = dlarray(zeros(obj.statesize,1,obj.batchsize));
action_batch = dlarray(zeros(obj.actionsize,1,obj.batchsize));
reward_batch = dlarray(zeros(1,obj.batchsize));
done_batch = zeros(1,obj.batchsize);
for i = 1:obj.batchsize %iterate through buffer and transfer experience over to mini batch
state_batch(:,:,i) = obj.statebuffer(:,:,sample_index_vector(i));
nextstate_batch(:,:,i) = obj.nextstatebuffer(:,:,sample_index_vector(i));
action_batch(:,:,i) = obj.actionbuffer(:,:,sample_index_vector(i));
reward_batch(:,i) = obj.rewardbuffer(:,sample_index_vector(i));
done_batch(:,i) = obj.donebuffer(:,sample_index_vector(i));
end
%update critic networks
criticgrad1 = dlfeval(@critic_gradient,obj.critic1,obj.target_actor,obj.target_critic1,obj.target_critic2,{state_batch},{nextstate_batch},{action_batch},reward_batch,done_batch,obj.batchsize);
[obj.critic1,obj.critic1_optimizer] = update(obj.critic1_optimizer,obj.critic1,criticgrad1);
criticgrad2 = dlfeval(@critic_gradient,obj.critic2,obj.target_actor,obj.target_critic1,obj.target_critic2,{state_batch},{nextstate_batch},{action_batch},reward_batch,done_batch,obj.batchsize);
[obj.critic2,obj.critic2_optimizer] = update(obj.critic2_optimizer,obj.critic2,criticgrad2);
%update num of critic updates
obj.num_critic_update = obj.num_critic_update + 1;
%delayed actor update + target network transfer
if (mod(obj.num_critic_update,2) == 0)
actorgrad = dlfeval(@actor_gradient,obj.actor,obj.critic1,obj.critic2,{state_batch});
[obj.actor,obj.actor_optimizer] = update(obj.actor_optimizer,obj.actor,actorgrad);
target_soft_transfer(obj);
end
end
end
end
%function used to soft transfer over to target networks
function target_soft_transfer(obj)
smooth_factor = 0.005;
for i = 1:6
obj.target_actor.Learnables{i} = smooth_factor*obj.actor.Learnables{i} + (1 - smooth_factor)*obj.target_actor.Learnables{i};
obj.target_critic1.Learnables{i} = smooth_factor*obj.critic1.Learnables{i} + (1 - smooth_factor)*obj.target_critic1.Learnables{i};
obj.target_critic2.Learnables{i} = smooth_factor*obj.critic2.Learnables{i} + (1 - smooth_factor)*obj.target_critic2.Learnables{i};
end
end
end
end
%obtain gradient of Q value wrt actor
function actorgradient = actor_gradient(actorNet,critic1,critic2,states,batchsize)
actoraction = getAction(actorNet,states); %obtain actor action
%obtain Q values
Q1 = getValue(critic1,states,actoraction);
Q2 = getValue(critic2,states,actoraction);
%obtain min Q values + reverse sign for gradient ascent
Qmin = min(Q1,Q2);
Q = -1*mean(Qmin);
gradient = dlgradient(Q,actorNet.Learnables); %calculate gradient of Q value wrt NN learnables
actorgradient = gradient;
end
%obtain gradient of critic NN
function criticgradient = critic_gradient(critic,target_actor,target_critic_1,target_critic_2,states,nextstates,actions,rewards,dones,batchsize)
%obtain target action
target_actions = getAction(target_actor,nextstates);
%target policy smoothing
for i = 1:batchsize
target_noise = randn([6,1]).*sqrt(0.2);
target_noise = clip(target_noise,-0.5,0.5);
target_actions{1}(:,:,i) = target_actions{1}(:,:,i) + target_noise; %add noise to action for smoothing
end
target_actions{1}(:,:,:) = clip(target_actions{1}(:,:,:),-1,1); %clip btw -1 and 1
%obtain Q values
Qtarget1 = getValue(target_critic_1,nextstates,target_actions);
Qtarget2 = getValue(target_critic_2,nextstates,target_actions);
Qmin = min(Qtarget1,Qtarget2);
Qoptimal = rewards + 0.99*Qmin.*(1 - dones);
Qpred = getValue(critic,states,actions);
%obtain critic loss
criticLoss = 0.5*mean((Qoptimal - Qpred).^2);
criticgradient = dlgradient(criticLoss,critic.Learnables);
end
And here is my code when using the built in TD3 agent
clc
%define times
dt = 0.1; %time steps
Tf = 7; %simulation time
%create stateInfo and actionInfo objects
statesize = 38;
actionsize = 6;
stateInfo = rlNumericSpec([statesize 1]);
actionInfo = rlNumericSpec([actionsize 1],LowerLimit = -1,UpperLimit = 1);
mdl = 'KUKA_EE_Controller_v18_disturbed';
blk = 'KUKA_EE_Controller_v18_disturbed/RL Agent';
%create environment object
env = rlSimulinkEnv(mdl,blk,stateInfo,actionInfo);
%assign reset function
env.ResetFcn = @ResetFunction;
% %create actor network
actorlayers = [
featureInputLayer(statesize)
fullyConnectedLayer(800)
reluLayer
fullyConnectedLayer(600)
reluLayer
fullyConnectedLayer(actionsize)
tanhLayer
];
actorNet = dlnetwork;
actorNet = addLayers(actorNet, actorlayers);
actorNet = initialize(actorNet);
actor = rlContinuousDeterministicActor(actorNet, stateInfo, actionInfo);
%create critic networks
statelayers = [
featureInputLayer(statesize, Name='states')
concatenationLayer(1, 2, Name='concat')
fullyConnectedLayer(400)
reluLayer
fullyConnectedLayer(400)
reluLayer
fullyConnectedLayer(1, Name='Qvalue')
];
actionlayers = featureInputLayer(actionsize, Name='actions');
criticNet = dlnetwork;
criticNet = addLayers(criticNet, statelayers);
criticNet = addLayers(criticNet, actionlayers);
criticNet = connectLayers(criticNet, 'actions', 'concat/in2');
criticNet = initialize(criticNet);
critic1 = rlQValueFunction(criticNet,stateInfo,actionInfo,ObservationInputNames='states',ActionInputNames='actions');
criticNet2 = dlnetwork;
criticNet2 = addLayers(criticNet2, statelayers);
criticNet2 = addLayers(criticNet2, actionlayers);
criticNet2 = connectLayers(criticNet2, 'actions', 'concat/in2');
criticNet2 = initialize(criticNet2);
critic2 = rlQValueFunction(criticNet2,stateInfo,actionInfo,ObservationInputNames='states',ActionInputNames='actions');
%create options object for actor and critic
actoroptions = rlOptimizerOptions(Optimizer='adam',LearnRate=0.001);
criticoptions = rlOptimizerOptions(Optimizer='adam',LearnRate=0.003);
agentoptions = rlTD3AgentOptions;
agentoptions.SampleTime = dt;
agentoptions.ActorOptimizerOptions = actoroptions;
agentoptions.CriticOptimizerOptions = criticoptions;
agentoptions.DiscountFactor = 0.99;
agentoptions.TargetSmoothFactor = 0.005;
agentoptions.ExperienceBufferLength = 1000000;
agentoptions.MiniBatchSize = 250;
agentoptions.ExplorationModel.StandardDeviation = 0.1;
agentoptions.ExplorationModel.StandardDeviationDecayRate = 1e-4;
agent = rlTD3Agent(actor, [critic1 critic2], agentoptions);
%create training options object
trainOpts = rlTrainingOptions(MaxEpisodes=20,MaxStepsPerEpisode=floor(Tf/dt),StopTrainingCriteria='none',SimulationStorageType='none');
%train agent
trainresults = train(agent,env,trainOpts);
I made my custom TD3 agent with the same actor and critic structures, with the same hyperparameters, and with the same agent options. But it doesn't seem to learn and I don't know why. I don't know if the rate transition block is having a negative impact on the training. One difference between my custom TD3 and the built-in TD3 is the actor gradient. In the matlab documentation on TD3 agent, it says the gradient is calculated for every sample in the mini batch then the gradient is accumulated and averaged out.
But how I calculated my actor gradient in my above code in the actorgradient function, I averaged the Q values over the minbatch first, then I performed only one gradient operation. So maybe that's one possible reason why my built in TD3 agent isn't learning. Here are my reward for
Built-TD3
Custom TD3, I stopped it early because it wasn't learning
I would appreciate any help because I have been stuck for months.
  3 Comments
Vincent
Vincent on 21 Aug 2025 at 0:14
@Emmanouil Tzorakoleftherakis sorry to spam, but I would really appreciate an experts advice, Regards
Vincent
Vincent on 21 Aug 2025 at 0:17
%obtain gradient of Q value wrt actor
function actorgradient = actor_gradient(actorNet,critic1,critic2,states,batchsize)
actoraction = getAction(actorNet,states); %obtain actor action
%obtain Q values
Q1 = getValue(critic1,states,actoraction);
Q2 = getValue(critic2,states,actoraction);
%obtain min Q values + reverse sign for gradient ascent
Qmin = min(Q1,Q2);
Q = -1*Qmin;
%accumulate gradient over minibatch
gradient = dlgradient(Q(1),actorNet.Learnables);
for i = 2:batchsize
grad = dlgradient(Q(i),actorNet.Learnables);
for j = 1:6
gradient{j} = gradient{j} + grad{j};
end
end
%average out gradient
for i = 1:6
gradient{i} = (1/batchsize)*gradient{i};
end
actorgradient = gradient;
end
As an update, I tried updating my actor gradient to accumulate gradient over each sample in the batch size, and my agent still didn't learn

Sign in to comment.

Answers (1)

Emmanouil Tzorakoleftherakis
Emmanouil Tzorakoleftherakis on 21 Aug 2025 at 13:00
Edited: Emmanouil Tzorakoleftherakis on 21 Aug 2025 at 13:03

A few things:

  1. The target action policy smoothing can be vectorized in the critic loss fcn
  2. Sampling to mini-batch data can be vectorized
  3. The critic loss function doesn't account for truncated episodes (e.g. isdone == 2)
  4. I think this line will always return false "if (isdone == false)". I don't think it particularly matters in this case however
  5. You are only sampling 1 mini-batch at the end of episode. We sample up to MaxMiniBatchPerEpoch samples per episode
  6. Your getActionWithExplorationImpl does not add noise (no exploration)
  7. You don't implement action noise decay as you did in the built-in agent.

There may still be small details here and there. I would focys more on getting your custom agent to start learning, rather than trying to replicate the built-in one, at least initially.

  5 Comments
Drew Davis
Drew Davis on 22 Aug 2025 at 13:51
Your original actor gradients look generally correct. While your modified actor gradient looks conceptually correct, it will be horribly ineffecient, not taking advantage of vectorized gradient operations.
Also, your critic operations are not consistent with https://www.mathworks.com/help/reinforcement-learning/ug/td3-agents.html. Specifically, YOU re-compute targets for each critic loss. In the toolbox implementation, targets are computed once, then used as a target for each critic loss. This may not make a huge difference, but is something to point out as a difference from the aforementioned doc page.
You can sample mini-batches at every step but that would not be consistent with the official toolbox implementation (LearningFrequency defaults to -1, indicating that learning will occur at end of episode). If you want to change your implementation to learn sample mini-batches at every step, consider setting the following on the toolbox agent options to have an apples-apples comparison.
agentOptions.LearningFrequency = 1
If you are not convinced about computing gradients, I suggest creating some simple test cases using simple actor and critic networks where you can hand derive the exact analytical gradients. That way you can compare your loss function gradients to the expected gradients and debug as needed.
You can also use this example as inspiration for writing your loss functions, since the DDPG and TD3 loss functions are generally the same (TD3 has +1 critic, and target action noise excitation).
Good luck with your thesis
Drew
Vincent
Vincent on 23 Aug 2025 at 13:20
Edited: Vincent on 23 Aug 2025 at 13:26
Thanks for the reply, so I think @Emmanouil Tzorakoleftherakis was correct it was I wasn't doing any exploration. Once I fixed that it started to learn finally. Also I'm not familiar with vectorization. Do you mind giving a sample code example?
I appreciate all the help from everyone, once again sorry for the spam, I was just under alot of stress.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!