DDPG reinforcement learning agent doesn't learn

2 views (last 30 days)
The project is about controlling a robot so it can solve a maze with DDPG agent. It is simulated via CoppeliaSim simulator. The problem is that the agent doesn't seem to learn since the action outputs stays always almost the same. The observations are the image caught by a camera that the robot has, and the distance to the exit. The outputs are the base speed of the wheels and the differential speed from the left wheel to the right wheel so the robot can turn.
My custom environment is the following:
classdef CustomEnvironment < rl.env.MATLABEnvironment
%CUSTOMENVIRONMENT: Template for defining custom environment in MATLAB.
%% Properties (set properties' attributes accordingly)
properties
workingDistance
image
clientID
robotHandle
initialPosition
initialOrientation
vrep
wheelBase
leftMotor
rightMotor
camera
parametros
wheelRadius
numSteps
numEpisodes
goal_position
end
%% Necessary Methods
methods
% Contructor method creates an instance of the environment
function this = CustomEnvironment(obsInfo, actInfo, initialWd, initialPosition, initialOrientation, clientID, robotHandle, left_Motor, right_Motor, camera, vrep, wheelBase, wheelRadius, parametros, initialImage)
% Inicializa el entorno y las propiedades aquí
this = this@rl.env.MATLABEnvironment(obsInfo, actInfo);
this.workingDistance = initialWd; % SOBRAAAAAAAAAAAAAARÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
this.image = initialImage;
this.clientID = clientID;
this.robotHandle = robotHandle;
this.leftMotor = left_Motor;
this.rightMotor = right_Motor;
this.camera = camera;
this.initialPosition = initialPosition;
this.initialOrientation = initialOrientation;
this.vrep = vrep;
this.wheelBase = wheelBase;
this.wheelRadius = wheelRadius;
this.parametros = parametros;
this.numSteps = 0;
this.numEpisodes = -1; % Se inicializa en -1 porque hay un episodio 0
end
function [Observation, Reward, IsDone, LoggedSignals] = step(this,Action)
% Aplicar la acción en CoppeliaSim (ajustar según tu lógica de control)
applyActioninCoppeliaSim(this, Action);
% Establecer lectura de imagen de la cámara (obtiene la imagen
% y actualiza la propiedad del entorno)
this.setImage();
% Establecer distancia (calcula la distancia y actualiza la
% propiedad del entorno)
this.setDistance();
% Recopilar nuevas observaciones del entorno y calcular la recompensa
Observation = this.getObservation();
Reward = this.computeReward(Action);
% Avanzar la simulación en CoppeliaSim un paso de tiempo
this.vrep.simxSynchronousTrigger(this.clientID);
% Definir condiciones de finalización del episodio (colisión/distancia)
IsDone = this.checkEndConditions();
% Devolver observaciones, recompensa y señales adicionales si es necesario
LoggedSignals = [];
this.numSteps = this.numSteps + 1;
display(['Step ', num2str(this.numSteps),' del episodio ', num2str(this.numEpisodes)])
end
% Reset environment to initial state and output initial observation
function Observation = reset(this)
this.numSteps = 0;
this.numEpisodes = this.numEpisodes + 1;
initialleftWheelSpeed = 0;
initialrightWheelSpeed = 0;
this.vrep.simxSetObjectPosition(this.clientID, this.robotHandle,-1, this.initialPosition, this.vrep.simx_opmode_blocking);
this.vrep.simxSetObjectOrientation(this.clientID, this.robotHandle, -1, this.initialOrientation, this.vrep.simx_opmode_blocking);
this.vrep.simxPauseSimulation(this.clientID, this.vrep.simx_opmode_blocking);
this.vrep.simxStartSimulation(this.clientID, this.vrep.simx_opmode_blocking);
this.vrep.simxSetJointTargetVelocity(this.clientID,this.leftMotor,initialleftWheelSpeed,this.vrep.simx_opmode_blocking);
this.vrep.simxSetJointTargetVelocity(this.clientID,this.rightMotor,initialrightWheelSpeed,this.vrep.simx_opmode_blocking);
% Lógica para obtener la nueva observación después del reset
Observation = this.getObservation();
end
function applyActioninCoppeliaSim(this, Action)
velocity = Action(1); % rad/s
diferentialVelocity = Action(2); % rad/s
leftWheelSpeed = (velocity - diferentialVelocity); % tiene que estar en rad/s
rightWheelSpeed = (velocity + diferentialVelocity);
% Aplicar las velocidades de las ruedas al robot en CoppeliaSim
this.vrep.simxSetJointTargetVelocity(this.clientID,this.leftMotor,leftWheelSpeed,this.vrep.simx_opmode_blocking);
this.vrep.simxSetJointTargetVelocity(this.clientID,this.rightMotor,rightWheelSpeed,this.vrep.simx_opmode_blocking);
end
function reward = computeReward(this, Action)
goal_distance = this.workingDistance;
reward = 10 / (goal_distance);
if goal_distance <= 1
reward = reward + 80;
end
end
function distance = calculateDistance(this)
[returnCode, position] = this.vrep.simxGetObjectPosition(this.clientID, this.robotHandle,-1, this.vrep.simx_opmode_blocking);
% this.goal_position = [1.4, 1.725];
this.goal_position = [12.285, 0.55];
distance = sqrt((position(1) - this.goal_position(1)) ^ 2 + (position(2) - this.goal_position(2)) ^ 2);
end
function img = getImage(this)
% Obtiene la imagen img del simulador
res = 1;
while (res ~= 0)
[res,~,img]=this.vrep.simxGetVisionSensorImage2(this.clientID,this.camera,1,this.vrep.simx_opmode_streaming);
end
end
function IsDone = checkEndConditions(this)
% Lógica para verificar si se cumplen las condiciones de finalización
IsDone = false; % Inicializo en falso --> episodio no finaliza
distanceMargin = 0.9;
if this.workingDistance < distanceMargin % episodio finaliza si llego al final del laberinto
IsDone = true;
else
IsDone = false;
end
end
end
%% Optional Methods (set methods' attributes accordingly)
methods
% Helper methods to create the environment
function setImage(this)
% Función para actualizar la imagen en el entorno
img = this.getImage();
this.image = img;
end
function setDistance(this)
% Función para actualizar la distancia de trabajo en el entorno
wD = this.calculateDistance();
this.workingDistance = wD;
end
function observation = getObservation(this) % antes: observation = getObservation(this)
% Lógica para obtener la observación del entorno
% observation = this.WorkingDistance; % Utiliza la distancia de trabajo como observación
observation = {this.image, this.workingDistance}; % prueba añadiendo distancia
end
% (optional) Visualization method
function plot(this)
% Initiate the visualization
% Update the visualization
envUpdatedCallback(this)
end
end
methods (Access = protected)
% (optional) update visualization everytime the environment is updated
% (notifyEnvUpdated is called)
function envUpdatedCallback(this)
end
end
end
The critic and the actor are the following:
numObservations = parametros.numPixX * parametros.numPixY; % La observación es la imagen NO-binarizada de la cámara aplanada en un vector
obsInfo = [rlNumericSpec([parametros.numPixX, parametros.numPixY, 1],...
'LowerLimit', zeros(parametros.numPixX, parametros.numPixY),...
'UpperLimit', 255*ones(parametros.numPixX, parametros.numPixY))...
rlNumericSpec([1, 1],...
'LowerLimit', 0,...
'UpperLimit', 200)];
obsInfo(1).Name = 'imageobs';
obsInfo(2).Name = 'distanceobs';
numActions = 2; % Las acciones son velocidad de giro de los motores de las ruedas
actInfo = rlNumericSpec([numActions 1],...
"LowerLimit",-10,...
"UpperLimit",10);
actInfo(1).Name = 'action';
hiddenLayerSize1 = 400;
hiddenLayerSize2 = 300;
statePath = [
imageInputLayer(obsInfo(1).Dimension, "Normalization","none",Name=obsInfo(1).Name)
convolution2dLayer(10,2,Stride=5,Padding=0)
reluLayer
fullyConnectedLayer(2)
concatenationLayer(3,2,Name="cat1")
fullyConnectedLayer(hiddenLayerSize1)
reluLayer
fullyConnectedLayer(hiddenLayerSize2)
additionLayer(2,Name="add")
reluLayer
fullyConnectedLayer(1,Name="fc4")];
distancePath = [
imageInputLayer(obsInfo(2).Dimension, "Normalization", "none", Name=obsInfo(2).Name)
fullyConnectedLayer(1,Name="fc5", ...
BiasLearnRateFactor=0, ...
Bias=0)];
actionPath = [
featureInputLayer(numActions,"Normalization","none",Name=actInfo(1).Name)
fullyConnectedLayer(hiddenLayerSize2, ...
Name="fc6", ...
BiasLearnRateFactor=0, ...
Bias=zeros(hiddenLayerSize2,1))];
criticNetwork = layerGraph(statePath);
criticNetwork = addLayers(criticNetwork,distancePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = connectLayers(criticNetwork,"fc5","cat1/in2");
criticNetwork = connectLayers(criticNetwork,"fc6","add/in2");
figure
plot(criticNetwork);
% Opciones para el optimizador crítico
criticOptions = rlOptimizerOptions("LearnRate",1e-3,"L2RegularizationFactor",1e-4,"GradientThreshold",1);
critic = rlQValueFunction(criticNetwork,obsInfo,actInfo,"ObservationInputNames",{"imageobs","distanceobs"},"ActionInputNames","action");
statePath = [
imageInputLayer(obsInfo(1).Dimension, "Normalization", "none", Name=obsInfo(1).Name) % Capa de entrada para la imagen
%convolution2dLayer(10,2,Stride=5,Padding=0)
fullyConnectedLayer(2,Name="fc0")
reluLayer
fullyConnectedLayer(2,Name="fc1")
concatenationLayer(3,2,Name="cat1")
fullyConnectedLayer(hiddenLayerSize1,Name="fc2")
reluLayer
fullyConnectedLayer(hiddenLayerSize2,Name="fc3")
reluLayer
fullyConnectedLayer(numActions,Name=actInfo(1).Name)
tanhLayer % Capa tangente hiperbólica para limitar la salida entre -1 y 1
scalingLayer(Name="scale1", ...
Scale=max(actInfo.UpperLimit)) % This layer is useful for scaling and shifting the outputs of nonlinear layers, such as tanhLayer and sigmoid.
];
distancePath = [
imageInputLayer(obsInfo(2).Dimension, "Normalization", "none",Name=obsInfo(2).Name)
fullyConnectedLayer(1, ...
Name="fc5", ...
BiasLearnRateFactor=0, ...
Bias=0)];
actorNetwork = layerGraph(statePath);
actorNetwork = addLayers(actorNetwork,distancePath);
actorNetwork = connectLayers(actorNetwork,"fc5","cat1/in2");
figure
plot(actorNetwork);
% Opciones para el optimizador actor
actorOptions = rlOptimizerOptions("LearnRate",1e-3,"L2RegularizationFactor",1e-4,"GradientThreshold",1);
actor = rlContinuousDeterministicActor(actorNetwork,obsInfo,actInfo,"ObservationInputNames",{"imageobs","distanceobs"});
agentOpts = rlDDPGAgentOptions(...
"SampleTime",sampleTime,...
"ActorOptimizerOptions",actorOptions,...
"CriticOptimizerOptions",criticOptions,...
"DiscountFactor",0.995, ... % antes 0.995
"MiniBatchSize",64 , ...
"ExperienceBufferLength",1e7); % antes 1e8
%"NoiseOptions", noiseOptions);
agentOpts.NoiseOptions.Variance = 0.6; % esto es para el otro tipo de
% ruido el de ohstein...
agentOpts.NoiseOptions.VarianceDecayRate = 1e-6;
% Creación del objeto del agente DDPG
obstacleAvoidanceAgent = rlDDPGAgent(actor,critic,agentOpts);
Does anyone see something that isn't right at first sight?
  3 Comments
Andrea Fernandez Fernandez
Thank you for your reply. I have changed to some of the sugestions you made:
  • I have added to the actor a convolutional layer
  • I have used a featureinputlayer for the distance observation
  • I have removed those learn factors that were 0
  • I have increased the variance to 0.8 and reduced the decay rate to 1e-7
  • I have increased the mini batch size to 128
For the connections with Coppelia, they seem right and the actions should be scaled into that range.
I have noted that for the first 128 steps, the actions change but from the 128 step on is when the actions stays always almost the same. I don't know if that says anything.

Sign in to comment.

Answers (0)

Products


Release

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!