DDPG reinforcement learning agent doesn't learn

Question

Andrea Fernandez Fernandez on 18 Feb 2024

0
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/2083523-ddpg-reinforcement-learning-agent-doesn-t-learn

Commented: Emmanouil Tzorakoleftherakis on 11 Apr 2024

The project is about controlling a robot so it can solve a maze with DDPG agent. It is simulated via CoppeliaSim simulator. The problem is that the agent doesn't seem to learn since the action outputs stays always almost the same. The observations are the image caught by a camera that the robot has, and the distance to the exit. The outputs are the base speed of the wheels and the differential speed from the left wheel to the right wheel so the robot can turn.

My custom environment is the following:

classdef CustomEnvironment < rl.env.MATLABEnvironment
    %CUSTOMENVIRONMENT: Template for defining custom environment in MATLAB.    
    
    %% Properties (set properties' attributes accordingly)
    properties
        workingDistance
        image
        clientID
        robotHandle
        initialPosition
        initialOrientation
        vrep
        wheelBase
        leftMotor
        rightMotor
        camera
        parametros
        wheelRadius
        numSteps
        numEpisodes
        goal_position
    end
    %% Necessary Methods
    methods              
        % Contructor method creates an instance of the environment
        function this = CustomEnvironment(obsInfo, actInfo, initialWd, initialPosition, initialOrientation, clientID, robotHandle, left_Motor, right_Motor, camera, vrep, wheelBase, wheelRadius, parametros, initialImage)
            % Inicializa el entorno y las propiedades aquí
            this = this@rl.env.MATLABEnvironment(obsInfo, actInfo);
            this.workingDistance = initialWd; % SOBRAAAAAAAAAAAAAARÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
            this.image = initialImage;
            this.clientID = clientID;
            this.robotHandle = robotHandle;
            this.leftMotor = left_Motor;
            this.rightMotor = right_Motor;
            this.camera = camera;
            this.initialPosition = initialPosition;
            this.initialOrientation = initialOrientation;
            this.vrep = vrep;
            this.wheelBase = wheelBase;
            this.wheelRadius = wheelRadius;
            this.parametros = parametros;
            this.numSteps = 0;
            this.numEpisodes = -1; % Se inicializa en -1 porque hay un episodio 0
        end
        
        function [Observation, Reward, IsDone, LoggedSignals] = step(this,Action)
            % Aplicar la acción en CoppeliaSim (ajustar según tu lógica de control)
            applyActioninCoppeliaSim(this, Action);
            % Establecer lectura de imagen de la cámara (obtiene la imagen
            % y actualiza la propiedad del entorno)
            this.setImage();
            % Establecer distancia (calcula la distancia y actualiza la
            % propiedad del entorno)
            this.setDistance();
            % Recopilar nuevas observaciones del entorno y calcular la recompensa
            Observation = this.getObservation();
            Reward = this.computeReward(Action);
            % Avanzar la simulación en CoppeliaSim un paso de tiempo
            this.vrep.simxSynchronousTrigger(this.clientID);
            % Definir condiciones de finalización del episodio (colisión/distancia)
            IsDone = this.checkEndConditions();
            % Devolver observaciones, recompensa y señales adicionales si es necesario
            LoggedSignals = [];
            this.numSteps = this.numSteps + 1;
            display(['Step ', num2str(this.numSteps),' del episodio ', num2str(this.numEpisodes)])
        end
        
        % Reset environment to initial state and output initial observation
        function Observation = reset(this)
            this.numSteps = 0;
            this.numEpisodes = this.numEpisodes + 1;
            initialleftWheelSpeed = 0;
            initialrightWheelSpeed = 0;
            this.vrep.simxSetObjectPosition(this.clientID, this.robotHandle,-1, this.initialPosition, this.vrep.simx_opmode_blocking);
            this.vrep.simxSetObjectOrientation(this.clientID, this.robotHandle, -1, this.initialOrientation, this.vrep.simx_opmode_blocking);
            this.vrep.simxPauseSimulation(this.clientID, this.vrep.simx_opmode_blocking);
            this.vrep.simxStartSimulation(this.clientID, this.vrep.simx_opmode_blocking);
            this.vrep.simxSetJointTargetVelocity(this.clientID,this.leftMotor,initialleftWheelSpeed,this.vrep.simx_opmode_blocking);
            this.vrep.simxSetJointTargetVelocity(this.clientID,this.rightMotor,initialrightWheelSpeed,this.vrep.simx_opmode_blocking);
            % Lógica para obtener la nueva observación después del reset
            Observation = this.getObservation();
        end
        function applyActioninCoppeliaSim(this, Action)
            velocity = Action(1); % rad/s
            diferentialVelocity = Action(2); % rad/s
            leftWheelSpeed = (velocity - diferentialVelocity); % tiene que estar en rad/s
            rightWheelSpeed = (velocity + diferentialVelocity);
            % Aplicar las velocidades de las ruedas al robot en CoppeliaSim
            this.vrep.simxSetJointTargetVelocity(this.clientID,this.leftMotor,leftWheelSpeed,this.vrep.simx_opmode_blocking);
            this.vrep.simxSetJointTargetVelocity(this.clientID,this.rightMotor,rightWheelSpeed,this.vrep.simx_opmode_blocking);
        end
        function reward = computeReward(this, Action)
            goal_distance = this.workingDistance;
            reward = 10 / (goal_distance);
            if goal_distance <= 1
                reward = reward + 80;
            end
        end
        function distance = calculateDistance(this)
            [returnCode, position] = this.vrep.simxGetObjectPosition(this.clientID, this.robotHandle,-1, this.vrep.simx_opmode_blocking);
            % this.goal_position = [1.4, 1.725];
            this.goal_position = [12.285, 0.55];
            distance = sqrt((position(1) - this.goal_position(1)) ^ 2 + (position(2) - this.goal_position(2)) ^ 2);
        end
        function img = getImage(this)
            % Obtiene la imagen img del simulador
            res = 1;
            while (res ~= 0)
                [res,~,img]=this.vrep.simxGetVisionSensorImage2(this.clientID,this.camera,1,this.vrep.simx_opmode_streaming);
            end
        end
                
        function IsDone = checkEndConditions(this)
            % Lógica para verificar si se cumplen las condiciones de finalización            
            IsDone = false; % Inicializo en falso --> episodio no finaliza
            distanceMargin = 0.9;
            if this.workingDistance < distanceMargin % episodio finaliza si llego al final del laberinto
                IsDone = true;
            else
                IsDone = false;
            end
        end
    end
    %% Optional Methods (set methods' attributes accordingly)
    methods               
        % Helper methods to create the environment
        function setImage(this)
            % Función para actualizar la imagen en el entorno
            img = this.getImage();
            this.image = img;
        end
        function setDistance(this)
            % Función para actualizar la distancia de trabajo en el entorno
            wD = this.calculateDistance();
            this.workingDistance = wD;
        end
        function observation = getObservation(this) % antes: observation = getObservation(this)
            % Lógica para obtener la observación del entorno
            % observation = this.WorkingDistance; % Utiliza la distancia de trabajo como observación
            observation = {this.image, this.workingDistance}; % prueba añadiendo distancia
        end
        
        % (optional) Visualization method
        function plot(this)
            % Initiate the visualization
            
            % Update the visualization
            envUpdatedCallback(this)
        end
    end
    
    methods (Access = protected)
        % (optional) update visualization everytime the environment is updated 
        % (notifyEnvUpdated is called)
        function envUpdatedCallback(this)
        end
    end
end

The critic and the actor are the following:

numObservations = parametros.numPixX * parametros.numPixY; % La observación es la imagen NO-binarizada de la cámara aplanada en un vector
obsInfo = [rlNumericSpec([parametros.numPixX, parametros.numPixY, 1],...
    'LowerLimit', zeros(parametros.numPixX, parametros.numPixY),...
    'UpperLimit', 255*ones(parametros.numPixX, parametros.numPixY))...
    rlNumericSpec([1, 1],...
    'LowerLimit', 0,...
    'UpperLimit', 200)];
obsInfo(1).Name = 'imageobs';
obsInfo(2).Name = 'distanceobs';
numActions = 2; % Las acciones son velocidad de giro de los motores de las ruedas
actInfo = rlNumericSpec([numActions 1],...
    "LowerLimit",-10,...
    "UpperLimit",10);
actInfo(1).Name = 'action';
    hiddenLayerSize1 = 400;
    hiddenLayerSize2 = 300;
    statePath = [
        imageInputLayer(obsInfo(1).Dimension, "Normalization","none",Name=obsInfo(1).Name)
        convolution2dLayer(10,2,Stride=5,Padding=0)
        reluLayer
        fullyConnectedLayer(2)
        concatenationLayer(3,2,Name="cat1")
        fullyConnectedLayer(hiddenLayerSize1)
        reluLayer
        fullyConnectedLayer(hiddenLayerSize2)
        additionLayer(2,Name="add")
        reluLayer
        fullyConnectedLayer(1,Name="fc4")];
    distancePath = [
        imageInputLayer(obsInfo(2).Dimension, "Normalization", "none", Name=obsInfo(2).Name)
        fullyConnectedLayer(1,Name="fc5", ...
        BiasLearnRateFactor=0, ...
        Bias=0)];
    actionPath = [
        featureInputLayer(numActions,"Normalization","none",Name=actInfo(1).Name)
        fullyConnectedLayer(hiddenLayerSize2, ...
        Name="fc6", ...
        BiasLearnRateFactor=0, ...
        Bias=zeros(hiddenLayerSize2,1))];
    criticNetwork = layerGraph(statePath);
    criticNetwork = addLayers(criticNetwork,distancePath);
    criticNetwork = addLayers(criticNetwork,actionPath);
    criticNetwork = connectLayers(criticNetwork,"fc5","cat1/in2");
    criticNetwork = connectLayers(criticNetwork,"fc6","add/in2");
    figure
    plot(criticNetwork);
    % Opciones para el optimizador crítico
    criticOptions = rlOptimizerOptions("LearnRate",1e-3,"L2RegularizationFactor",1e-4,"GradientThreshold",1);
    critic = rlQValueFunction(criticNetwork,obsInfo,actInfo,"ObservationInputNames",{"imageobs","distanceobs"},"ActionInputNames","action");
    
    statePath = [
        imageInputLayer(obsInfo(1).Dimension, "Normalization", "none", Name=obsInfo(1).Name) % Capa de entrada para la imagen
        %convolution2dLayer(10,2,Stride=5,Padding=0)
        fullyConnectedLayer(2,Name="fc0")
        reluLayer
        fullyConnectedLayer(2,Name="fc1")
        concatenationLayer(3,2,Name="cat1")
        fullyConnectedLayer(hiddenLayerSize1,Name="fc2")
        reluLayer
        fullyConnectedLayer(hiddenLayerSize2,Name="fc3")
        reluLayer
        fullyConnectedLayer(numActions,Name=actInfo(1).Name)
        tanhLayer % Capa tangente hiperbólica para limitar la salida entre -1 y 1
        scalingLayer(Name="scale1", ...
        Scale=max(actInfo.UpperLimit)) % This layer is useful for scaling and shifting the outputs of nonlinear layers, such as tanhLayer and sigmoid.
        ]; 
    
    distancePath = [
        imageInputLayer(obsInfo(2).Dimension, "Normalization", "none",Name=obsInfo(2).Name)
        fullyConnectedLayer(1, ...
        Name="fc5", ...
        BiasLearnRateFactor=0, ...
        Bias=0)];
    
    actorNetwork = layerGraph(statePath);
    actorNetwork = addLayers(actorNetwork,distancePath);
    actorNetwork = connectLayers(actorNetwork,"fc5","cat1/in2");
    figure
    plot(actorNetwork);
    
    % Opciones para el optimizador actor
    actorOptions = rlOptimizerOptions("LearnRate",1e-3,"L2RegularizationFactor",1e-4,"GradientThreshold",1);
    actor = rlContinuousDeterministicActor(actorNetwork,obsInfo,actInfo,"ObservationInputNames",{"imageobs","distanceobs"});
    agentOpts = rlDDPGAgentOptions(...
        "SampleTime",sampleTime,...
        "ActorOptimizerOptions",actorOptions,...
        "CriticOptimizerOptions",criticOptions,...
        "DiscountFactor",0.995, ... % antes 0.995
        "MiniBatchSize",64 , ...
        "ExperienceBufferLength",1e7); % antes 1e8
        %"NoiseOptions", noiseOptions); 
    
    agentOpts.NoiseOptions.Variance = 0.6; % esto es para el otro tipo de
    % ruido el de ohstein...
    agentOpts.NoiseOptions.VarianceDecayRate = 1e-6;
    
    % Creación del objeto del agente DDPG
    obstacleAvoidanceAgent = rlDDPGAgent(actor,critic,agentOpts);

Does anyone see something that isn't right at first sight?

3 Comments
Show 1 older commentHide 1 older comment

Andrea Fernandez Fernandez on 22 Feb 2024

Thank you for your reply. I have changed to some of the sugestions you made:

I have added to the actor a convolutional layer
I have used a featureinputlayer for the distance observation
I have removed those learn factors that were 0
I have increased the variance to 0.8 and reduced the decay rate to 1e-7
I have increased the mini batch size to 128

For the connections with Coppelia, they seem right and the actions should be scaled into that range.

I have noted that for the first 128 steps, the actions change but from the 128 step on is when the actions stays always almost the same. I don't know if that says anything.

DDPG reinforcement learning agent doesn't learn

3 Comments
Show 1 older commentHide 1 older comment

Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

DDPG reinforcement learning agent doesn't learn

3 Comments Show 1 older commentHide 1 older comment

Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

3 Comments
Show 1 older commentHide 1 older comment