GPU Execution Profiling of the Generated Code

This example shows you how to generate an execution profiling report for the generated CUDA® code by using the gpucoder.profile function. Fog rectification is used as an example to demonstrate this concept.

Prerequisites

  • CUDA enabled NVIDIA® GPU with compute capability 3.2 or higher.

  • NVIDIA CUDA toolkit and driver.

  • Environment variables for the compilers and libraries. For information on the supported versions of the compilers and libraries, see Third-party Products. For setting up the environment variables, see Environment Variables.

  • Image Processing Toolbox™ for reading and displaying images.

  • Embedded Coder® for generating the report.

  • This example is supported only on the Linux® platform.

Create a Folder and Copy Relevant Files

The following line of code creates a folder in your current working folder (pwd), and copies all the relevant files into this folder. If you do not want to perform this operation or if you cannot generate files in this folder, change your current working folder.

gpucoderdemo_setup('gpucoderdemo_fog_rectification_profile');

Verify the GPU Environment

Use the coder.checkGpuInstall function and verify that the compilers and libraries needed for running this example are set up correctly.

envCfg = coder.gpuEnvConfig('host');
envCfg.BasicCodegen = 1;
envCfg.Quiet = 1;
coder.checkGpuInstall(envCfg);

Prepare for Code Generation and Profiling

The fog_rectification.m function takes foggy image as input and returns defogged image. To generate CUDA code, create a GPU code configuration object with a dynamic library ('dll') build type. Because the gpucoder.profile function accepts only an Embedded Coder configuration object, a coder.EmbeddedCodeConfig configuration object is used even if the ecoder option is not explicitly enabled.

inputImage = imread('foggyInput.png');
inputs  ={inputImage};
designFileName = 'fog_rectification';

cfg = coder.gpuConfig('dll');
cfg.GpuConfig.MallocMode = 'discrete';

Generate Execution Profiling Report

Run gpucoder.profile with a 'Threshold' of 0.003 to see the SIL execution report. Note that 'Threshold' of 0.003 is just a representative number. If the generated code has a lot of CUDA API or kernel calls, it is very likely that each call would constitute only a small proportion of the total time. Hence, it is advisable to set a low threshold value (between 0.001-0.005) to generate a meaningful profiling report. Moreover, it is not advisable to set 'NumCalls' to a very low number (less than 5) since it will not give an accurate representation of a typical execution profile.

gpucoder.profile(designFileName, inputs, 'CodegenConfig', cfg, 'Threshold', 0.003, 'NumCalls', 10);
### Starting SIL execution for 'fog_rectification'
    To terminate execution: <a href="matlab: targets_hyperlink_manager('run',1);">clear fog_rectification_sil</a>
    Execution profiling data is available for viewing. Open <a href="matlab:Simulink.sdi.view;">Simulation Data Inspector.</a>
    Execution profiling report available after termination.
### Stopping SIL execution for 'fog_rectification'

Code Execution Profiling Report for fog_rectification

The code execution profiling report provides metrics based on data collected from a SIL or PIL execution. Execution times are calculated from data recorded by instrumentation probes added to the SIL or PIL test harness or inside the code generated for each component. See Code Execution Profiling for more information on Sections 1 and 2. Note that these numbers are representative and the actual values depend on your hardware setup. This profiling was done using MATLAB 19a on a machine with an 8 core, 2.6GHz Intel® Xeon® CPU and an NVIDIA Titan XP GPU

1. Summary

2. Profiled Sections of Code

3. GPU Profiling Trace for fog_rectification

Section 3 shows the complete trace of GPU calls which have a runtime higher than the 'Threshold'. The 'Threshold' parameter is defined as the fraction of the 'maximum execution time' for a run (excluding the first run). For example, out of 9 calls to the top level 'fog_rectification' function, if the third call took the maximum time (let's say t ms), then the 'maximum execution time' is t milliseconds. All GPU calls taking more than (Threshold*t) ms will be shown in this section. Hovering over the calls shows the runtime values of other relevant non-timing related information for each call. For example, hovering over fog_rectification_kernel10 shows the block dimensions, grid dimensions and the static shared memory size in KiB of that call. This trace corresponds to the run that took the maximum time.

4. GPU Profiling Summary for fog_rectification

Section 4 in the report shows the summary of GPU calls that are shown in section 3. The 'cudaFree' is called 25 times per run of 'fog_rectification' and the average time taken by 25 calls of 'cudaFree' over 9 runs of 'fog_rectification' is 1.9652 ms. This summary is sorted in descending order of time taken to give the users an idea which GPU call is taking the maximum time.

Run Command: Cleanup

Remove the temporary files and return to the original folder

cleanup