Main Content

coder.gpu.nokernel

Pragma to disable kernel creation for loops

Description

example

coder.gpu.nokernel() is a loop level pragma that when placed immediately before a for loop prevents the code generator from generating CUDA® kernels for the statements within the loop. This pragma does not require any input parameters.

This function is a code generation function. It has no effect in MATLAB®.

Examples

collapse all

This example shows how to use the nokernel pragma in a function and prevent the code generator from generating CUDA kernels for the statements within the loop

In one file, write the entry-point function nestedLoop that accepts two vector inputs A,B of size 32x512. The function has two nested for-loops of different iteration lengths, one for operating along the column and one for operating along the row. The first nested loop computes the sum of the two vector inputs while the second nested loop scales the sum by a factor of three.

function [C] = nestedLoop(A, B)
    G = zeros(32, 512);
    C = zeros(32, 512);        
   
    coder.gpu.kernelfun();
    % This nested loop will be fused
    for i = 1:32
       for j = 1:512
           G(i,j) = A(1,j) + B(1,j);
       end
    end

    coder.gpu.nokernel();  
    for i = 1:32
       for j = 1:512
           C(i,j) = G(i,j) * 3;
       end
    end    
end

Use the codegen function to generate CUDA MEX function.

cfg = coder.gpuConfig('mex');
cfg.GenerateReport = true;
codegen -config cfg -args {ones(1,512,'double'),ones(1,512,'double')} nestedLoop

GPU Coder creates two kernels: nestedLoop_kernel1 to perform the computation G(i,j) = A(1,j) + B(1,j); of the first nested loop and nestedLoop_kernel2 kernel to perform the computation C(i,j) = G(i,j) * 3; of the second nested loop. The second kernel is created for the inner loop of the second nested loop. The noKernel pragma is applicable only to the loop that immediately follows the statement. Snippets of the generated kernels are shown.

static __global__ __launch_bounds__(512, 1) void nestedLoop_kernel1(const real_T
  B[512], const real_T A[512], real_T G[16384])
{
  uint32_T threadId;
  ...
  if (i < 32) {
    G[i + (j << 5)] = A[j] + B[j];
  }
}
static __global__ __launch_bounds__(512, 1) void nestedLoop_kernel2(real_T G
  [16384], int32_T i, real_T C[16384])
{
  uint32_T threadId;
  ...;
  if (j < 512) {
    C[i + (j << 5)] = G[i + (j << 5)] * 3.0;
  }

A snippet of the main function shows that the code generator has fused the first nested loop as indicated by the kernel launch parameters. As mentioned earlier, the outer loop of the second nested loop is the one that is not mapped to a kernel. Hence the code generator places a for-loop statement just before the call to the second CUDA kernel nestedLoop_kernel2.

void nestedLoop(const real_T A[512], const real_T B[512], real_T C[16384])
{
  int32_T i;
  ...
  //  These two loops will be fused
  cudaMemcpy(gpu_B, (void *)&B[0], 4096UL, cudaMemcpyHostToDevice);
  cudaMemcpy(gpu_A, (void *)&A[0], 4096UL, cudaMemcpyHostToDevice);
  nestedLoop_kernel1<<<dim3(32U, 1U, 1U), dim3(512U, 1U, 1U)>>>(*gpu_B, *gpu_A, *
    gpu_G);
  for (i = 0; i < 32; i++) {
    nestedLoop_kernel2<<<dim3(1U, 1U, 1U), dim3(512U, 1U, 1U)>>>(*gpu_G, i,
      *gpu_C);
    C_dirtyOnGpu = true;
  }
...
  cudaFree(*gpu_C);
}
Introduced in R2019a