patchEmbeddingLayer

Patch embedding layer

Since R2023b

Description

A patch embedding layer maps patches of pixels to vectors. Use this layer in vision transformer neural networks to encode information about patches in images.

Creation

Syntax

layer = patchEmbeddingLayer(patchSize,outputSize)

layer = patchEmbeddingLayer(patchSize,outputSize,Name=Value)

Description

layer = patchEmbeddingLayer(patchSize,outputSize) creates a patch embedding layer and sets the PatchSize and OutputSize properties.

This feature requires a Deep Learning Toolbox™ license.

example

layer = patchEmbeddingLayer(patchSize,outputSize,Name=Value) sets additional properties using one or more name-value arguments.

example

Properties

expand all

Patch Embedding

`PatchSize` — Size of patches to split input images into
Read-only: positive integer | row vector of positive integers

This property is read-only.

Size of patches to split input images into, specified as a positive integer or row vector of positive integers.

If PatchSize is a vector, then each element of PatchSize is the size of the patch in the corresponding spatial dimension of the input. If PatchSize is a scalar, then the layer uses the same value for all spatial dimensions of the input.

`SpatialFlattenMode` — Mode for flattening output of convolution operation
`"column-major"` (default) | `"row-major"`

Mode for flattening the output of the convolution operation, specified as "column-major" or "row-major".

If SpatialFlattenMode is "column-major", then the flatten operation outputs the data in its column-major representation. For example, consider the input:

The output in column-major representation is:

AFlat = [1 4 7 2 5 8 3 6 9];

If SpatialFlattenMode is "row-major", then the flatten operation outputs the data in its row-major representation. For example, consider the input:

The output in row-major representation is:

AFlat = [1 2 3 4 5 6 7 8 9];

Set this option when creating or importing models that require this representation.

`OutputSize` — Size of output vectors
Read-only: positive integer

This property is read-only.

Size of output vectors, specified as a positive integer.

`InputSize` — Number of input channels
Read-only: `"auto"` (default) | positive integer

This property is read-only.

Number of input channels, specified as one of these values:

"auto" — Automatically determine the number of input channels at training time.
Positive integer — Configure the layer for the specified number of input channels. InputSize and the number of channels in the layer input data must match. For example, if the input is an RGB image, then InputSize must be 3.

Parameters and Initialization

`WeightsInitializer` — Function to initialize weights
`"glorot"` (default) | `"he"` | `"narrow-normal"` | `"zeros"` | `"ones"` | function handle

Function to initialize the weights, specified as one of these values:

"glorot" — Initialize the weights with the Glorot initializer [1] [2] (also known as Xavier initializer). The Glorot initializer independently samples from a normal distribution with zero mean and a variance of 2/(numIn + numOut), where numIn and numOut are the values of the InputSize and OutputSize properties, respectively.
"he" – Initialize the weights with the He initializer [3]. The He initializer samples from a normal distribution with zero mean and a variance of 2/numIn, where numIn is the value of the InputSize property.
"narrow-normal" — Initialize the weights by independently sampling from a normal distribution with zero mean and a standard deviation of 0.01.
"zeros" — Initialize the weights with zeros.
"ones" — Initialize the weights with ones.
Function handle — Initialize the weights with a custom function. If you specify a function handle, then the function must have the form weights = func(sz), where sz is the size of the weights.

The layer initializes the weights only when the Weights property is empty.

Data Types: char | string | function_handle

`BiasInitializer` — Function to initialize biases
`"zeros"` (default) | `"narrow-normal"` | `"ones"` | function handle

Function to initialize the biases, specified as one of these values:

"zeros" — Initialize the biases with zeros.
"ones" — Initialize the biases with ones.
"narrow-normal" — Initialize the biases by independently sampling from a normal distribution with a mean of zero and a standard deviation of 0.01.
Function handle — Initialize the biases with a custom function. If you specify a function handle, then the function must have the form bias = func(sz), where sz is the size of the biases.

The layer initializes the biases only when the Bias property is empty.

The PatchEmbeddingLayer object stores this property as a character vector or a function handle.

Data Types: char | string | function_handle

`Weights` — Learnable weights
`[]` (default) | numeric array

Learnable weights.

If PatchSize is a positive integer, then Weights is an PatchSize-by-...-by-PatchSize-by-InputSize-by-OutputSize numeric array or [], where the number of dimensions of size PatchSize is the number of spatial dimensions of the input.

If PatchSize is a vector, then Weights is an PatchSize(1)-by-...-by-PatchSize(K)-by-InputSize-by-OutputSize numeric array or [], where K is the number of spatial dimensions of the input.

The layer weights are learnable parameters. You can specify the initial value of the weights directly using the Weights property of the layer. When you train a network, if the Weights property of the layer is nonempty, then the trainnet (Deep Learning Toolbox) function uses the Weights property as the initial value. If the Weights property is empty, then the software uses the initializer specified by the WeightsInitializer property of the layer.

Data Types: single | double

`Bias` — Layer biases
`[]` (default) | column vector

Layer biases, specified as a numeric column vector of length OutputSize or [].

The layer biases are learnable parameters. When you train a neural network, if Bias is nonempty, then the trainnet (Deep Learning Toolbox) function uses the Bias property as the initial value. If Bias is empty, then software uses the initializer specified by the BiasInitializer property.

Data Types: single | double

Learning Rate and Regularization

`WeightLearnRateFactor` — Learning rate factor for weights
`1` (default) | nonnegative scalar

Learning rate factor for the weights, specified as a nonnegative scalar.

The software multiplies this factor by the global learning rate to determine the learning rate for the weights in this layer. For example, if WeightLearnRateFactor is 2, then the learning rate for the weights in this layer is twice the current global learning rate. The software determines the global learning rate based on the settings you specify using the trainingOptions (Deep Learning Toolbox) function.

Data Types: double

`BiasLearnRateFactor` — Learning rate factor for biases
`1` (default) | nonnegative scalar

Learning rate factor for the biases, specified as a nonnegative scalar.

The software multiplies this factor by the global learning rate to determine the learning rate for the biases in this layer. For example, if BiasLearnRateFactor is 2, then the learning rate for the biases in the layer is twice the current global learning rate. The software determines the global learning rate based on the settings you specify using the trainingOptions (Deep Learning Toolbox) function.

The PatchEmbeddingLayer object stores this property as double type.

`WeightL2Factor` — L₂ regularization factor for weights
`1` (default) | nonnegative scalar

L₂ regularization factor for the weights, specified as a nonnegative scalar.

The software multiplies this factor by the global L₂ regularization factor to determine the L₂ regularization for the weights in this layer. For example, if WeightL2Factor is 2, then the L₂ regularization for the weights in this layer is twice the global L₂ regularization factor. You can specify the global L₂ regularization factor using the trainingOptions (Deep Learning Toolbox) function.

Data Types: double

`BiasL2Factor` — L₂ regularization factor for biases
`0` (default) | nonnegative scalar

L₂ regularization factor for the biases, specified as a nonnegative scalar.

The software multiplies this factor by the global L₂ regularization factor to determine the L₂ regularization for the biases in this layer. For example, if BiasL2Factor is 2, then the L₂ regularization for the biases in this layer is twice the global L₂ regularization factor. The software determines the global L₂ regularization factor based on the settings you specify using the trainingOptions (Deep Learning Toolbox) function.

Layer

`Name` — Layer name
`''` (default) | character vector

Layer name, specified as a character vector. For Layer array input, the trainnet (Deep Learning Toolbox) and dlnetwork (Deep Learning Toolbox) functions automatically assign names to unnamed layers.

Data Types: char

`NumInputs` — Number of inputs
Read-only: `1` (default)

This property is read-only.

Number of inputs to the layer, stored as 1. This layer accepts a single input only.

Data Types: double

`InputNames` — Input names
Read-only: `{'in'}` (default)

This property is read-only.

Input names, stored as {'in'}. This layer accepts a single input only.

Data Types: cell

`NumOutputs` — Number of outputs
Read-only: `1` (default)

This property is read-only.

Number of outputs from the layer, stored as 1. This layer has a single output only.

Data Types: double

`OutputNames` — Output names
Read-only: `{'out'}` (default)

This property is read-only.

Output names, stored as {'out'}. This layer has a single output only.

Data Types: cell

Examples

collapse all

Create Patch Embedding Layer

This example uses:

Open Live Script

Create a patch embedding layer that embeds patches of size 16 with an output size of 768.

patchSize = 16;
embeddingOutputSize = 768;
layer = patchEmbeddingLayer(patchSize,embeddingOutputSize)

layer = 
  PatchEmbeddingLayer with properties:

                     Name: ''
                PatchSize: 16
                InputSize: 'auto'
               OutputSize: 768
       SpatialFlattenMode: 'column-major'
       WeightsInitializer: 'glorot'
          BiasInitializer: 'zeros'
    WeightLearnRateFactor: 1
      BiasLearnRateFactor: 1
           WeightL2Factor: 1
             BiasL2Factor: 1

   Learnable Parameters
                  Weights: []
                     Bias: []

   State Parameters
    No properties.

  Show all properties

Create a dlnetwork object.

net = dlnetwork;

Specify layers of the network, including a patch embedding layer.

inputSize = [384 384 3];

maxPosition = (inputSize(1)/patchSize)^2 + 1;

numHeads = 4;
numKeyChannels = 4*embeddingOutputSize;

numClasses = 1000;

layers = [ 
    imageInputLayer(inputSize)
    patchEmbeddingLayer(patchSize,embeddingOutputSize,Name="patch-emb")
    embeddingConcatenationLayer(Name="emb-cat")
    positionEmbeddingLayer(embeddingOutputSize,maxPosition,Name="pos-emb");
    additionLayer(2,Name="add")
    selfAttentionLayer(numHeads,numKeyChannels,AttentionMask="causal")
    indexing1dLayer(Name="idx-first")
    fullyConnectedLayer(numClasses)
    softmaxLayer];
net = addLayers(net,layers);

Connect the embedding concatenation layer with the "in2" input of the addition layer.

net = connectLayers(net,"emb-cat","add/in2");

View the neural network architecture.

plot(net)

Figure contains an axes object. The axes object contains an object of type graphplot.

Algorithms

expand all

Patch Embedding Layer

A patch embedding layer maps patches of pixels to vectors. You can use this layer in vision transformer neural networks to encode information about patches in images.

The layer uses a convolution operation with the layer weights and biases to extract and project patches from the input. In particular, the layer:

Splits the input into non-overlapping patches.
Flattens the patches.
Projects the flattened patches to the output size.
Flattens the spatial dimensions of the projected output.

Layer Input and Output Formats

Most layers in a layer array or layer graph pass data to subsequent layers as formatted dlarray (Deep Learning Toolbox) objects. The format of a dlarray object is a string of characters in which each character describes the corresponding dimension of the data. The format consists of one or more of these characters:

"S" — Spatial
"C" — Channel
"B" — Batch
"T" — Time
"U" — Unspecified

For example, you can describe 2-D image data that is represented as a 4-D array, where the first two dimensions correspond to the spatial dimensions of the images, the third dimension corresponds to the channels of the images, and the fourth dimension corresponds to the batch dimension, as having the format "SSCB" (spatial, spatial, channel, batch).

You can interact with these dlarray objects in automatic differentiation workflows, such as those for:

developing a custom layer
using a functionLayer (Deep Learning Toolbox) object
using the forward (Deep Learning Toolbox) and predict (Deep Learning Toolbox) functions with dlnetwork objects

This table shows the supported input formats of PatchEmbeddingLayer objects and the corresponding output format. If the software passes the output of the layer to a custom layer that does not inherit from the nnet.layer.Formattable class, or to a FunctionLayer object with the Formattable property set to 0 (false), then the layer receives an unformatted dlarray object with dimensions ordered according to the formats in this table. The formats listed here are only a subset of the formats that the layer supports. The layer might support additional formats, such as formats with additional "S" (spatial) or "U" (unspecified) dimensions.

Input Format	Output Format
`"SCB"` (spatial, channel, batch)	`"SCB"` (spatial, channel, batch)
`"SSCB"` (spatial, spatial, channel, batch)	`"SCB"` (spatial, channel, batch)
`"SSSCB"` (spatial, spatial, spatial, channel, batch)	`"SCB"` (spatial, channel, batch)
`"SCBT"` (spatial, channel, batch, time)	`"SCBT"` (spatial, channel, batch, time)
`"SSCBT"` (spatial, spatial, channel, batch, time)	`"SCBT"` (spatial, channel, batch, time)
`"SSSCBT"` (spatial, spatial, spatial, channel, batch, time)	`"SCBT"` (spatial, channel, batch, time)
`"SC"` (spatial, channel)	`"SC"` (spatial, channel)
`"SSC"` (spatial, spatial, channel)	`"SC"` (spatial, channel)
`"SSSC"` (spatial, spatial, spatial, channel)	`"SC"` (spatial, channel)

In dlnetwork objects, PatchEmbeddingLayer objects also support these input and output format combinations.

Input Format	Output Format
`"SCT"` (spatial, channel, time)	`"SCT"` (spatial, channel, time)
`"SSCT"` (spatial, spatial, channel, time)	`"SCT"` (spatial, channel, time)
`"SSSCT"` (spatial, spatial, spatial, channel, time)	`"SCT"` (spatial, channel, time)

References

[1] Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. "An Image is Worth 16x16 words: Transformers for Image Recognition at Scale." Preprint, submitted June 3, 2021. https://doi.org/10.48550/arXiv.2010.11929.

[2] Glorot, Xavier, and Yoshua Bengio. "Understanding the Difficulty of Training Deep Feedforward Neural Networks." In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 249–356. Sardinia, Italy: AISTATS, 2010. https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf

[3] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." In 2015 IEEE International Conference on Computer Vision (ICCV), 1026–34. Santiago, Chile: IEEE, 2015. https://doi.org/10.1109/ICCV.2015.123

Extended Capabilities

expand all

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

Usage notes and limitations:

Code generation supports only 1-D and 2-D spatial data. 3-D spatial or more than 3-D spatial data format such as "SSS" or "SSSS" is not supported.
You can generate generic C/C++ code that does not depend on third-party libraries and deploy the generated code to hardware platforms.

GPU Code Generation
Generate CUDA® code for NVIDIA® GPUs using GPU Coder™.

Usage notes and limitations:

Code generation supports only 1-D and 2-D spatial data. 3-D spatial or more than 3-D spatial data format such as "SSS" or "SSSS" is not supported.
You can generate CUDA code that is independent of deep learning libraries and deploy the generated code to platforms that use NVIDIA^® GPU processors.

Version History

Introduced in R2023b

expand all

R2024a: Specify spatial flattening mode of patch embedding layers

Specify the mode for flattening output of the convolution operation using the SpatialFlattenMode option. Set this option when creating or importing models that require this representation.

patchEmbeddingLayer

Description

Creation

Syntax

Description

Properties

Patch Embedding

`PatchSize` — Size of patches to split input images into
Read-only: positive integer | row vector of positive integers

`SpatialFlattenMode` — Mode for flattening output of convolution operation
`"column-major"` (default) | `"row-major"`

`OutputSize` — Size of output vectors
Read-only: positive integer

`InputSize` — Number of input channels
Read-only: `"auto"` (default) | positive integer

Parameters and Initialization

`WeightsInitializer` — Function to initialize weights
`"glorot"` (default) | `"he"` | `"narrow-normal"` | `"zeros"` | `"ones"` | function handle

`BiasInitializer` — Function to initialize biases
`"zeros"` (default) | `"narrow-normal"` | `"ones"` | function handle

`Weights` — Learnable weights
`[]` (default) | numeric array

`Bias` — Layer biases
`[]` (default) | column vector

Learning Rate and Regularization

`WeightLearnRateFactor` — Learning rate factor for weights
`1` (default) | nonnegative scalar

`BiasLearnRateFactor` — Learning rate factor for biases
`1` (default) | nonnegative scalar

`WeightL2Factor` — L₂ regularization factor for weights
`1` (default) | nonnegative scalar

`BiasL2Factor` — L₂ regularization factor for biases
`0` (default) | nonnegative scalar

Layer

`Name` — Layer name
`''` (default) | character vector

`NumInputs` — Number of inputs
Read-only: `1` (default)

`InputNames` — Input names
Read-only: `{'in'}` (default)

`NumOutputs` — Number of outputs
Read-only: `1` (default)

`OutputNames` — Output names
Read-only: `{'out'}` (default)

Examples

Create Patch Embedding Layer

Algorithms

Patch Embedding Layer

Layer Input and Output Formats

References

Extended Capabilities

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

GPU Code Generation
Generate CUDA® code for NVIDIA® GPUs using GPU Coder™.

Version History

R2024a: Specify spatial flattening mode of patch embedding layers

See Also

Topics

patchEmbeddingLayer

Description

Creation

Syntax

Description

Properties

Patch Embedding

PatchSize — Size of patches to split input images into Read-only: positive integer | row vector of positive integers

SpatialFlattenMode — Mode for flattening output of convolution operation "column-major" (default) | "row-major"

OutputSize — Size of output vectors Read-only: positive integer

InputSize — Number of input channels Read-only: "auto" (default) | positive integer

Parameters and Initialization

WeightsInitializer — Function to initialize weights "glorot" (default) | "he" | "narrow-normal" | "zeros" | "ones" | function handle

BiasInitializer — Function to initialize biases "zeros" (default) | "narrow-normal" | "ones" | function handle

Weights — Learnable weights [] (default) | numeric array

Bias — Layer biases [] (default) | column vector

Learning Rate and Regularization

WeightLearnRateFactor — Learning rate factor for weights 1 (default) | nonnegative scalar

BiasLearnRateFactor — Learning rate factor for biases 1 (default) | nonnegative scalar

WeightL2Factor — L2 regularization factor for weights 1 (default) | nonnegative scalar

BiasL2Factor — L2 regularization factor for biases 0 (default) | nonnegative scalar

Layer

Name — Layer name '' (default) | character vector

NumInputs — Number of inputs Read-only: 1 (default)

InputNames — Input names Read-only: {'in'} (default)

NumOutputs — Number of outputs Read-only: 1 (default)

OutputNames — Output names Read-only: {'out'} (default)

Examples

Create Patch Embedding Layer

Algorithms

Patch Embedding Layer

Layer Input and Output Formats

References

Extended Capabilities

C/C++ Code Generation Generate C and C++ code using MATLAB® Coder™.

GPU Code Generation Generate CUDA® code for NVIDIA® GPUs using GPU Coder™.

Version History

R2024a: Specify spatial flattening mode of patch embedding layers

See Also

Topics

`PatchSize` — Size of patches to split input images into
Read-only: positive integer | row vector of positive integers

`SpatialFlattenMode` — Mode for flattening output of convolution operation
`"column-major"` (default) | `"row-major"`

`OutputSize` — Size of output vectors
Read-only: positive integer

`InputSize` — Number of input channels
Read-only: `"auto"` (default) | positive integer

`WeightsInitializer` — Function to initialize weights
`"glorot"` (default) | `"he"` | `"narrow-normal"` | `"zeros"` | `"ones"` | function handle

`BiasInitializer` — Function to initialize biases
`"zeros"` (default) | `"narrow-normal"` | `"ones"` | function handle

`Weights` — Learnable weights
`[]` (default) | numeric array

`Bias` — Layer biases
`[]` (default) | column vector

`WeightLearnRateFactor` — Learning rate factor for weights
`1` (default) | nonnegative scalar

`BiasLearnRateFactor` — Learning rate factor for biases
`1` (default) | nonnegative scalar

`WeightL2Factor` — L₂ regularization factor for weights
`1` (default) | nonnegative scalar

`BiasL2Factor` — L₂ regularization factor for biases
`0` (default) | nonnegative scalar

`Name` — Layer name
`''` (default) | character vector

`NumInputs` — Number of inputs
Read-only: `1` (default)

`InputNames` — Input names
Read-only: `{'in'}` (default)

`NumOutputs` — Number of outputs
Read-only: `1` (default)

`OutputNames` — Output names
Read-only: `{'out'}` (default)

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

GPU Code Generation
Generate CUDA® code for NVIDIA® GPUs using GPU Coder™.