Generate SIMD Code from MATLAB Functions for ARM Platforms

You can generate single instruction, multiple data (SIMD) code from certain MATLAB^® functions by using ARM^® Neon technology. SIMD is a computing paradigm in which a single instruction processes multiple data. Many modern processors have SIMD instructions that, for example, perform several additions or multiplications at once. For computationally intensive operations on supported functions, SIMD intrinsics can significantly improve the performance of the generated code on ARM Cortex^®-A platforms.

To generate SIMD code for Intel^® platforms, see Generate SIMD Code from MATLAB Functions for Intel Platforms.

MATLAB Functions That Support SIMD Code for ARM

When certain conditions are met, you can generate SIMD code by using ARM Neon technology. This table lists MATLAB functions that support SIMD code generation. The table also details the conditions under which the support is available. Some other functions support SIMD code generation when they generate control flow code that the code generator can convert to vectorized code. For example, the code generator can replace some for-loops that contain conditional expressions with SIMD instructions.

MATLAB Function	Conditions
`plus`	The input argument is of data type `single`, `int8`, `int16`, `int32`, `int64`, `uint8`, `uint16`, `uint32`, or `uint64`. For integer data types, Saturate on integer overflow is set to `No`.
`minus`	The input argument is of data type `single`, `int8`, `int16`, `int32`, `int64`, `uint8`, `uint16`, `uint32`, or `uint64`. For integer data types, Saturate on integer overflow is set to `No`.
`times`	The input argument is of data type `single`, `int8`, `int16`, `int32`, `uint8`, `uint16`, or `uint32`. For integer data types, Saturate on integer overflow is set to `No`.
`max`	The input argument is of data type `single`, `int8`, `int16`, `int32`, `uint8`, `uint16`, or `uint32`.
`min`	The input argument is of data type `single`, `int8`, `int16`, `int32`, `uint8`, `uint16`, or `uint32`.
`bitand`	The input argument is of data type `int8`, `int16`, `int32`, `int64`, `uint8`, `uint16`, `uint32`, or `uint64`.
`bitor`	The input argument is of data type `int8`, `int16`, `int32`, `int64`, `uint8`, `uint16`, `uint32`, or `uint64`.
`bitxor`	The input argument is of data type `int8`, `int16`, `int32`, `int64`, `uint8`, `uint16`, `uint32`, or `uint64`.
`bitshift`	The input argument is of data type `int8`, `int16`, `int32`, or `int64`.
`cast`	The function performs one of the following data type conversions from the input argument to the output argument: `single` to `int32` `int32` to `single` `uint32` to `single`
`lt` (`<`), `le` (`<=`), `gt` (`>`), `ge` (`>=`), or `eq` (`==`)	The input argument is of data type `single`, `int32`, or `uint32`.

If you have a DSP System Toolbox™, you can generate SIMD code from certain MATLAB System objects. For more information, see System objects in DSP System Toolbox that Support SIMD Code Generation (DSP System Toolbox).

Generate Plain C Code and SIMD Code for ARM

Consider the MATLAB function dynamic. This function consists of addition and multiplication operations between the variable-size arrays A and B. These arrays have a data type of single and maximum dimensions of 100-by-100.

function C = dynamic(A, B)
   assert(all(size(A) <= [100 100]));
   assert(all(size(B) <= [100 100]));
   assert(isa(A, 'single'));
   assert(isa(B, 'single'));

   C = zeros(size(A), 'like', A);
   for i = 1:numel(A)
       C(i) = (A(i) .* B(i)) + (A(i) .* B(i));
   end
end

To generate plain C code:

For C library code generation, create a code generation configuration object.
```
cfg = coder.config('lib');
```
Set the coder.HardwareImplementation object ProdHWDeviceType property to 'ARM Compatible->ARM Cortex-A (32-bit)' or 'ARM Compatible->ARM Cortex-A (64-bit)'. Alternatively, set HardwareBoard to a board that results in those device type values.
```
cfg.HardwareImplementation.ProdHWDeviceType = 'ARM Compatible->ARM Cortex-A (32-bit)';
```
If you are using the MATLAB Coder™ app to generate code:
- Set the Hardware Board parameter to None-Select device below, NVIDIA Drive, or NVIDIA Jetson. Alternatively, select a hardware board that results in the following device parameter values.
- Set the Device vendor parameter to ARM Compatible.
- Set the Device type parameter to ARM Cortex-A (32-bit) or ARM Cortex-A (64-bit).
To generate a static library in the default location, codegen\lib\dynamic, use the codegen function.
```
codegen('-config', cfg, 'dynamic');
```

In the list of generated files, click dynamic.c. In the plain (non-SIMD) C code, each loop iteration produces one result.

void dynamic(const float A_data[], const int A_size[2], const float B_data[],
             const int B_size[2], float C_data[], int C_size[2])
{
  int i;
  int loop_ub_tmp;
  (void)B_size;
  C_size[0] = (signed char)A_size[0];
  C_size[1] = (signed char)A_size[1];
  loop_ub_tmp = (signed char)A_size[0] * (signed char)A_size[1];
  if (loop_ub_tmp - 1 >= 0) {
    memset(&C_data[0], 0, (unsigned int)loop_ub_tmp * sizeof(float));
  }
  loop_ub_tmp = A_size[0] * A_size[1];
  for (i = 0; i < loop_ub_tmp; i++) {
    float f;
    f = A_data[i] * B_data[i];
    C_data[i] = f + f;
  }
}

To generate SIMD C code:

For C library code generation, use the coder.config function to create a code generation configuration object.
```
cfg = coder.config('lib');
```
Set the coder.HardwareImplementation object ProdHWDeviceType property to 'ARM Compatible->ARM Cortex-A (32-bit)' or 'ARM Compatible->ARM Cortex-A (64-bit)'. Alternatively, set HardwareBoard to a board that results in those device type values.
```
cfg.HardwareImplementation.ProdHWDeviceType = 'ARM Compatible->ARM Cortex-A (32-bit)';
```
If you are using the MATLAB Coder app to generate code:
- Set the Hardware Board parameter to None-Select device below, NVIDIA Drive, or NVIDIA Jetson. Alternatively, select a hardware board that results in the following device parameter values.
- Set the Device vendor parameter to ARM Compatible.
- Set the Device type parameter to ARM Cortex-A (32-bit) or ARM Cortex-A (64-bit).
Set the InstructionSetExtensions property to 'Neon v7'. The Neon v7 instruction set supports target hardware ARM v7 and above, including ARM v8 and ARM v9.
```
cfg.InstructionSetExtensions = 'Neon v7';
```
If you are using the MATLAB Coder app to generate code, on the Speed tab, set the Leverage target hardware instruction set extensions parameter to Neon v7.
Optionally, set the OptimizeReductions property to 'on' to generate SIMD code for reduction operations such as sum and product functions.
```
cfg.OptimizeReductions = 'on';
```
If you are using the MATLAB Coder app to generate code, on the Speed tab, select the Optimize reductions parameter.
Optionally, set the FMA property to 'on' to generate SIMD code for fused multiply-add operations.
```
cfg.InstructionSetExtensionsConfig.FMA = 'on';
```
If you are using the MATLAB Coder app to generate code, on the Speed tab, select the FMA parameter.
Use the codegen function to generate a static library in the default location, codegen\lib\dynamic.
```
codegen('-config', cfg, 'dynamic');
```

In the list of generated files, click dynamic.c.

void dynamic(const float A_data[], const int A_size[2], const float B_data[],
             const int B_size[2], float C_data[], int C_size[2])
{
  int i;
  int loop_ub_tmp;
  int scalarLB;
  int vectorUB;
  (void)B_size;
  C_size[0] = (signed char)A_size[0];
  C_size[1] = (signed char)A_size[1];
  loop_ub_tmp = (signed char)A_size[0] * (signed char)A_size[1];
  if (loop_ub_tmp - 1 >= 0) {
    memset(&C_data[0], 0, (unsigned int)loop_ub_tmp * sizeof(float));
  }
  loop_ub_tmp = A_size[0] * A_size[1];
  scalarLB = (loop_ub_tmp / 4) << 2;
  vectorUB = scalarLB - 4;
  for (i = 0; i <= vectorUB; i += 4) {
    vst1q_f32(&C_data[i],
              vmlaq_f32(vmulq_f32(vld1q_f32(&A_data[i]), vld1q_f32(&B_data[i])),
                        vld1q_f32(&A_data[i]), vld1q_f32(&B_data[i])));
  }
  for (i = scalarLB; i < loop_ub_tmp; i++) {
    float f;
    f = A_data[i] * B_data[i];
    C_data[i] = f + f;
  }
}

The SIMD instructions are the intrinsic functions that start with the identifier v. The functions process multiple data in a single iteration of the loop because the loop increments by four for single data types. For models that process more data and are computationally more intensive than this one, the presence of SIMD instructions can significantly reduce the code execution time.

The second for loop is in the generated code because the for loop that contains SIMD code must be divisible by four for single data types. The second loop processes the remainder of the data.