Main Content

GPU Execution Profiling of the Generated Code

This example shows you how to generate an execution profiling report for the generated CUDA® code by using the gpucoder.profile function. Fog rectification is used as an example to demonstrate this concept.


  • CUDA enabled NVIDIA® GPU.

  • NVIDIA CUDA toolkit and driver.

  • Environment variables for the compilers and libraries. For information on the supported versions of the compilers and libraries, see Third-Party Hardware. For setting up the environment variables, see Setting Up the Prerequisite Products.

  • The profiling workflow of this example depends on the nvprof tool from NVIDIA. From CUDA toolkit v10.1, NVIDIA restricts access to performance counters to only admin users. To enable GPU performance counters to be used by all users, see the instructions provided in

Verify GPU Environment

To verify that the compilers and libraries necessary for running this example are set up correctly, use the coder.checkGpuInstall function.

envCfg = coder.gpuEnvConfig('host');
envCfg.BasicCodegen = 1;
envCfg.Quiet = 1;

Prepare for Code Generation and Profiling

The fog_rectification.m function takes a foggy image as input and returns a defogged image. To generate CUDA code, create a GPU code configuration object with a dynamic library ('dll') build type. Because the gpucoder.profile function accepts only an Embedded Coder configuration object, a coder.EmbeddedCodeConfig configuration object is used even if the ecoder option is not explicitly selected.

inputImage = imread('foggyInput.png');
inputs  = {inputImage};
designFileName = 'fog_rectification';

cfg = coder.gpuConfig('dll');
cfg.GpuConfig.MallocMode = 'discrete';

Generate Execution Profiling Report

Run gpucoder.profile with a threshold value of 0.003 to see the SIL execution report. The threshold value of 0.003 is just a representative number. If the generated code has a lot of CUDA API or kernel calls, it is likely that each call constitutes only a small proportion of the total time. It is advisable to set a low threshold value (between 0.001-0.005) to generate a meaningful profiling report. It is not advisable to set number of executions value to a very low number (less than 5) because it does not produce an accurate representation of a typical execution profile.

gpucoder.profile(designFileName, inputs, ...
    'CodegenConfig', cfg, 'Threshold', 0.003, 'NumCalls', 10);
### Starting SIL execution for 'fog_rectification'
    To terminate execution: clear fog_rectification_sil
    Execution profiling data is available for viewing. Open Simulation Data Inspector.
    Execution profiling report available after termination.
### Stopping SIL execution for 'fog_rectification'

Code Execution Profiling Report for the fog_rectification Function

The code execution profiling report provides metrics based on data collected from a SIL or PIL execution. Execution times are calculated from data recorded by instrumentation probes added to the SIL or PIL test harness or inside the code generated for each component. For more information, see View Execution Times (Embedded Coder). These numbers are representative. The actual values depend on your hardware setup. This profiling was done using MATLAB R2020a on a machine with an 6 core, 3.5GHz Intel® Xeon® CPU, and an NVIDIA TITAN XP GPU

1. Summary

2. Profiled Sections of Code

3. GPU Profiling Trace for fog_rectification

Section 3 shows the complete trace of GPU calls that have a runtime higher than the threshold value. The 'Threshold' parameter is defined as the fraction of the maximum execution time for a run (excluding the first run). For example, out of 9 calls to the top level fog_rectification function, if the third call took the maximum time (t, ms), then the maximum execution time is t milliseconds. All GPU calls taking more than Threshold*t milliseconds is shown in this section. Placing your cursor over the calls shows the run-time values of other relevant non-timing related information for each call. For example, placing your cursor over fog_rectification_kernel10 shows the block dimensions, grid dimensions, and the static shared memory size in KiB of that call. This trace corresponds to the run that took the maximum time.

4. GPU Profiling Summary for fog_rectification

Section 4 in the report shows the summary of GPU calls that are shown in section 3. The cudaFree is called 17 times per run of fog_rectification and the average time taken by 17 calls of cudaFree over 9 runs of fog_rectification is 1.7154 milliseconds. This summary is sorted in descending order of time taken to give the users an idea which GPU call is taking the maximum time.