Performant and Deployable Monocular Visual SLAM

Since R2024a

This example uses:

Visual simultaneous localization and mapping (vSLAM) refers to the process of calculating the position and orientation of a camera with respect to its surroundings while simultaneously mapping the environment. Applications for vSLAM include augmented reality, robotics, and autonomous driving. In this example, the algorithm uses only visual inputs from the camera.

MATLAB® vSLAM examples each show one of these vSLAM implementations:

Modular and Modifiable ─ Builds a visual SLAM pipeline step-by-step by using functions and objects. For more details and a list of these functions and objects, see the Implement Visual SLAM in MATLAB (Computer Vision Toolbox) topic. The approach described in the topic contains modular code, and is designed to teach the details of a vSLAM implementation, that is loosely based on the popular and reliable ORB-SLAM [1] algorithm. The code is easily navigable, enabling you to explore the algorithm and test how its parameters can impact the performance of the system. This modular implementation is most suitable for learning, experimentation, and modification to test your own ideas.
Performant and Deployable ─ Uses the monovslam object which contains a complete vSLAM workflow. The object offers a practical solution with greatly improved execution speed and code generation. To generate multi-threaded C/C++ code from monovslam (Computer Vision Toolbox), you can use MATLAB® Coder™. The generated code is portable, and you can deploy it on non-PC hardware as well as a ROS node, as demonstrated in the Build and Deploy Visual SLAM Algorithm with ROS in MATLAB (Computer Vision Toolbox) example.

This example shows the Performant and Deployable implementation for processing image data from a monocular camera. For more details about the modular and modifiable implementation, see the Monocular Visual Simultaneous Localization and Mapping (Computer Vision Toolbox) example.

Download Data

This example uses data from the TUM RGB-D benchmark [2]. The size of the data set is 1.38 GB. You can download the data set to a temporary folder using this code.

baseDownloadURL = "https://vision.in.tum.de/rgbd/dataset/freiburg3/rgbd_dataset_freiburg3_long_office_household.tgz"; 
dataFolder = fullfile(tempdir,"tum_rgbd_dataset",filesep); 
options = weboptions(Timeout=Inf);
tgzFileName = dataFolder + "fr3_office.tgz";
folderExists = exist(dataFolder,"dir");

% Create a folder in a temporary directory in which to save the downloaded file
if ~folderExists  
    mkdir(dataFolder)
    disp("Downloading fr3_office.tgz (1.38 GB). This download can take a few minutes.") 
    websave(tgzFileName,baseDownloadURL,options); 
    
    % Extract contents of the downloaded file
    disp("Extracting fr3_office.tgz (1.38 GB) ...") 
    untar(tgzFileName,dataFolder); 
end

Create an imageDatastoreobject to store the RGB images.

imageFolder   = dataFolder + "rgbd_dataset_freiburg3_long_office_household/rgb/";
imds          = imageDatastore(imageFolder);

Implement Visual SLAM

Create a cameraIntrinsics object to store the camera intrinsic parameters. The intrinsic parameters for the data set are available on the data set webpage [3]. Note that the images in the data set are already undistorted, hence there is no need to specify the distortion coefficients. If you are using your own camera, you can estimate the intrinsic parameters using the Camera Calibrator (Computer Vision Toolbox) app.

focalLength = [535.4,539.2];  % in units of pixels
principalPoint = [320.1,247.6];  % in units of pixels
imageSize = [480,640];  % in units of pixels
intrinsics = cameraIntrinsics(focalLength,principalPoint,imageSize);

Create a monovslam object with the camera intrinsics object. You may also need to modify these properties based on the video data you intend to use.

MaxNumPoints ─ This property controls the number of ORB features extracted from every frame. For an image resolution of 480-by-640 pixels, set MaxNumPoints 1000. For higher resolutions, such as 720-by-1280, set it to 2000. Larger values slow down the execution speed.
SkipMaxFrames ─ When tracking from frame to frame is reliable, this property sets the maximum number of frames that the object can skip before the next frame to be a key frame. With a frame rate of 30 fps, set SkipMaxFrames to 20. For a slower frame rate, choose a smaller value. Increasing SkipMaxFrames enhances the processing speed, but may lead to tracking loss during fast camera motion.

numPoints = 1000;
numSkipFrames = 20;
vslam = monovslam(intrinsics,MaxNumPoints=numPoints,SkipMaxFrames=numSkipFrames);

Process each image by using the addFrame (Computer Vision Toolbox) function. Use the plot (Computer Vision Toolbox) method to visualize the estimated camera trajectory and the 3-D map points.

for i=1:numel(imds.Files)
    I = readimage(imds,i);
    addFrame(vslam, I);

    if hasNewKeyFrame(vslam)
        % Display 3-D map points and camera trajectory
        plot(vslam);   
    end
end

Note that the monovslam object runs several algorithm parts on separate threads, which can introduce a latency in processing of an image frame added by using the addFrame function.

% Plot intermediate results and wait until all images are processed
while ~isDone(vslam)
    if hasNewKeyFrame(vslam)
        ax = plot(vslam);
    end
end

Figure contains an axes object. The axes object with xlabel X, ylabel Y contains 12 objects of type line, text, patch, scatter. This object represents Camera trajectory.

After all the images are processed, you can collect the final 3-D map points and camera poses for further analysis.

xyzPoints = mapPoints(vslam);
[camPoses,viewIds] = poses(vslam);

Compare with the Ground Truth

You can compare the optimized camera trajectory with the ground truth to evaluate the accuracy of ORB-SLAM. The comparison can also help tune the parameters used the monovslam (Computer Vision Toolbox) object. The downloaded data contains a groundtruth.txt file that stores the ground truth of the camera pose of each frame. You can import the data using the helperImportGroundTruth function. Once you import the ground truth, you can visualize the actual camera trajectory and calculate the root-mean-square-error (RMSE) of trajectory estimates.

% Load ground truth from the downloaded data
gTruthFileName = dataFolder+"rgbd_dataset_freiburg3_long_office_household/groundtruth.txt";
gTruth = helperImportGroundTruth(gTruthFileName,imds);

% Compute the scale between the estimated camera trajectory and 
% the actual camera trajectory
estimatedLocations = vertcat(camPoses.Translation);
actualLocations = vertcat(gTruth(viewIds).Translation);
scale = median(vecnorm(actualLocations,2,2))/ median(vecnorm(estimatedLocations,2,2));
scaledEstimate = estimatedLocations * scale;

% Compute the RMSE of the trajectory estimates
rmse = sqrt(mean( sum((scaledEstimate - actualLocations).^2, 2) ));
disp(['Absolute RMSE for key frame trajectory (m): ', num2str(rmse)]);

Absolute RMSE for key frame trajectory (m): 0.092595

% Plot the ground truth trajectory at the scale of the estimated trajectory
actualLocations = actualLocations/scale;
hold(ax,"on");
plot3(ax,actualLocations(:,1),actualLocations(:,2),actualLocations(:,3),...
    Color="g",LineWidth=2);

Figure contains an axes object. The axes object with xlabel X, ylabel Y contains 13 objects of type line, text, patch, scatter. This object represents Camera trajectory.

Code Generation

You can generate efficient multi-threaded C++ code from the monovslam (Computer Vision Toolbox) object, which is suitable for deployment on a host computer or an embedded platform that has all the third-party dependencies, including OpenCV [4] and g2o [5]. For illustrative purposes, in this section, you generate MEX code. For more information about deploying the generated code as a ROS node, see the Build and Deploy Visual SLAM Algorithm with ROS in MATLAB (Computer Vision Toolbox) example.

To meet the requirements of MATLAB Coder, you must restructure the code to isolate the algorithm from the visualization code. The algorithmic content was encapsulated in the helperMonoVisualSLAM function, which takes an image as the input and outputs 3-D world points and camera poses as matrices. The function internally creates a monovslam (Computer Vision Toolbox) object and saves it into a persistent variable called vslam. Note that helperMonoVisualSLAM function does not display the reconstructed 3-D point cloud or the camera poses. The plot (Computer Vision Toolbox) method of the monovslam (Computer Vision Toolbox) class was not designed to generate code because visualization is typically not deployed on embedded systems.

type("helperMonoVisualSLAM.m"); % display function contents

function [xyzPoint, camPoses] = helperMonoVisualSLAM(image)

%   Copyright 2023 The MathWorks Inc.

%#codegen

persistent vslam xyzPointsInternal camPosesInternal

if isempty(vslam)
    % Create a monovslam class to process the image data
    focalLength    = [535.4, 539.2];    % in units of pixels
    principalPoint = [320.1, 247.6];    % in units of pixels
    imageSize      = [480,     640];    % in units of pixels
    intrinsics     = cameraIntrinsics(focalLength, principalPoint, imageSize);

    vslam = monovslam(intrinsics);
end

% Process each image frame
addFrame(vslam, image);

% Get 3-D map points and camera poses
if isempty(xyzPointsInternal) || hasNewKeyFrame(vslam)
    xyzPointsInternal = mapPoints(vslam);
    camPosesInternal = poses(vslam);
end

xyzPoint = xyzPointsInternal;

% Convert camera poses to homogeneous transformation matrices
camPoses = cat(3, camPosesInternal.A);

You can compile the helperMonoVisualSLAM function into a MEX file by using the codegen command. Note that the generated MEX file has the same name as the original MATLAB file with _mex appended, unless you use the -o option to specify the name of the executable.

compileTimeInputs  = {coder.typeof(I)};

codegen helperMonoVisualSLAM -args compileTimeInputs

Code generation successful.

Process the image data frame-by-frame using the MEX-file.

% Process each image frame 
for i=1:numel(imds.Files)
    [xyzPoints, camPoses] = helperMonoVisualSLAM_mex(readimage(imds,i));
end

Implementations for Other Sensors

To address the challenge of unknown scale in monocular visual SLAM, you can utilize a stereo camera or an RGB-D sensor, both of which can calculate the actual dimensions of the scene. You can find more details about the implementation of stereo and RGB-D visual SLAM on the stereovslam (Computer Vision Toolbox) and rgbdvslam (Computer Vision Toolbox) object pages, respectively. Alternatively, you can integrate the camera with an IMU sensor to recover the actual scale using the visual-inertial SLAM algorithm. These examples demonstrate some of these implementations:

Stereo Visual Simultaneous Localization and Mapping (Computer Vision Toolbox)
Visual SLAM with RGB-D Camera (Computer Vision Toolbox)
Monocular Visual-Inertial SLAM (Computer Vision Toolbox)

Reference

[1] Mur-Artal, Raul, J. M. M. Montiel, and Juan D. Tardos. “ORB-SLAM: A Versatile and Accurate Monocular SLAM System.” IEEE Transactions on Robotics 31, no. 5 (October 2015): 1147–63. https://doi.org/10.1109/TRO.2015.2463671.

[2] Sturm, Jürgen, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. "A benchmark for the evaluation of RGB-D SLAM systems". In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 573-580, 2012.

[3] https://cvg.cit.tum.de/data/datasets/rgbd-dataset/file_formats

[4] “OpenCV: OpenCV Modules. ”https://docs.opencv.org/4.7.0/.

[5] Kümmerle, Rainer, Giorgio Grisetti, Hauke Strasdat, Kurt Konolige, and Wolfram Burgard. “G2o: A General Framework for Graph Optimization.” In 2011 IEEE International Conference on Robotics and Automation, 3607–13, 2011. https://doi.org/10.1109/ICRA.2011.5979949.