attentionLayer

Dot-product attention layer

Since R2024a

expand all in page

Description

A dot-product attention layer focuses on parts of the input using weighted multiplication operations.

Creation

Syntax

layer = attentionLayer(numHeads)

layer = attentionLayer(numHeads,Name=Value)

Description

layer = attentionLayer(numHeads) creates a dot-product attention layer and sets the NumHeads property.

example

layer = attentionLayer(numHeads,Name=Value) also sets the Scale, HasPaddingMaskInput, HasScoresOutput, AttentionMask, DropoutProbability, and Name properties using one or more name-value arguments.

Properties

expand all

Attention

`NumHeads` — Number of heads
positive integer

Number of heads, specified as a positive integer.

Each head performs a separate linear transformation of the input and computes attention weights independently. The layer uses these attention weights to compute a weighted sum of the input representations, generating a context vector. Increasing the number of heads lets the model capture different types of dependencies and attend to different parts of the input simultaneously. Reducing the number of heads can lower the computational cost of the layer.

The value of NumHeads must evenly divide the size of the channel dimension of the input queries, keys, and values.

`Scale` — Multiplicative factor for scaling dot product of queries and keys
`"auto"` | numeric scalar

Multiplicative factor for scaling dot product of queries and keys, specified as one of these values:

"auto" — Multiply the dot product by 1/sqrt(D), where D is the number of channels of the keys divided by NumHeads.
Numeric scalar — Multiply the dot product by the specified scalar.

`HasPaddingMaskInput` — Flag indicating whether layer has mask input
`0` (`false`) (default) | `1` (`true`)

Flag indicating whether the layer has an input that represents the padding mask, specified as 0 (false) or 1 (true).

If the HasPaddingMaskInput property is 0 (false), then the layer has three inputs with the names "query", "key", and "value", which correspond to the input queries, keys, and values, respectively. In this case, the layer treats all elements as data.

If the HasPaddingMaskInput property is 1 (true), then the layer has an additional input with the name "mask", which corresponds to the padding mask. In this case, the padding mask is an array of ones and zeros. The layer uses or ignores elements of the queries, keys, and values when the corresponding element in the mask is one or zero, respectively.

The format of the padding mask must match that of the input keys. The size of the "S" (spatial), "T" (time), and "B" (batch) dimensions of the padding mask must match the size of the corresponding dimensions in the keys and values.

The padding mask can have any number of channels. The software uses only the values in the first channel to indicate padding values.

`HasScoresOutput` — Flag indicating whether layer has scores output
`0` (`false`) (default) | `1` (`true`)

Flag indicating whether the layer has an output that represents the scores (also known as the attention weights), specified as 0 (false) or 1 (true).

If the HasScoresOutput property is 0 (false), then the layer has one output with the name "out", which corresponds to the output data.

If the HasScoresOutput property is 1 (true), then the layer has two inputs with the names "out" and "scores", which correspond to the output data and the attention scores, respectively.

`AttentionMask` — Attention mask
`"none"` (default) | `"causal"` | numeric array | logical array

Attention mask indicating which elements to include when applying the attention operation, specified as one of these values:

"none" — Do not prevent attention to elements with respect to their positions. If AttentionMask is "none", then the software prevents attention using only the padding mask.
"causal" — Prevent elements in position m in the "S" (spatial) or "T" (time) dimension of the input queries from providing attention to the elements in positions n, where n is greater than m in the corresponding dimension of the input keys and values. Use this option for auto-regressive models.
Logical or numeric array — Prevent attention to elements of the input keys and values when the corresponding element in the specified array is 0. The specified array must be an N_k-by-N_q matrix or a N_k-by-N_q-by-numObservations array, N_k is the size of the "S" (spatial) or "T" (time) dimension of the input keys, N_q is the size of the corresponding dimension of the input queries, and numObservations is the size of the "B" dimension in the input queries.

`DropoutProbability` — Dropout probability for attention scores
`0` (default) | scalar in the range [0, 1)

Probability of dropping out attention scores, specified as a scalar in the range [0, 1).

During training, the software randomly sets values in the attention scores to zero using the specified probability. These dropouts can encourage the model to learn more robust and generalizable representations by preventing it from relying too heavily on specific dependencies.

Layer

`Name` — Layer name
`''` (default) | character vector | string scalar

Layer name, specified as a character vector or a string scalar. For Layer array input, the trainnet and dlnetwork functions automatically assign names to unnamed layers.

The AttentionLayer object stores this property as a character vector.

Data Types: char | string

`NumInputs` — Number of inputs
`3` (default) | `4`

Number of inputs to the layer, returned as 3 or 4.

The padding mask can have any number of channels. The software uses only the values in the first channel to indicate padding values.

Data Types: double

`InputNames` — Input names
`["query" "key" "value"]` (default) | `["query" "key" "value" "mask"]`

Input names of the layer, returned as a cell array of character vectors.

The padding mask can have any number of channels. The software uses only the values in the first channel to indicate padding values.

The AttentionLayer object stores this property as a cell array of character vectors.

`NumOutputs` — Number of outputs
Read-only: `1` (default) | `2`

This property is read-only.

Number of outputs of the layer.

If the HasScoresOutput property is 0 (false), then the layer has one output with the name "out", which corresponds to the output data.

If the HasScoresOutput property is 1 (true), then the layer has two inputs with the names "out" and "scores", which correspond to the output data and the attention scores, respectively.

Data Types: double

`OutputNames` — Output names
Read-only: `"out"` (default) | `["out" "scores"]`

This property is read-only.

Output names of the layer.

If the HasScoresOutput property is 0 (false), then the layer has one output with the name "out", which corresponds to the output data.

If the HasScoresOutput property is 1 (true), then the layer has two inputs with the names "out" and "scores", which correspond to the output data and the attention scores, respectively.

The AttentionLayer object stores this property as a cell array of character vectors.

Examples

collapse all

Create Attention Layer

Open Live Script

Create a dot-product attention layer with 10 heads.

layer = attentionLayer(10)

layer = 
  AttentionLayer with properties:

                   Name: ''
              NumInputs: 3
             InputNames: {'query'  'key'  'value'}
               NumHeads: 10
                  Scale: 'auto'
          AttentionMask: 'none'
     DropoutProbability: 0
    HasPaddingMaskInput: 0
        HasScoresOutput: 0

   Learnable Parameters
    No properties.

   State Parameters
    No properties.

  Show all properties

Create Cross-Attention Neural Network

Open Live Script

Create a simple neural network with cross-attention.

numChannels = 256;
numHeads = 8;

net = dlnetwork;

layers = [
    sequenceInputLayer(1,Name="query")
    fullyConnectedLayer(numChannels)
    attentionLayer(numHeads,Name="attention")
    fullyConnectedLayer(numChannels,Name="fc-out")];

net = addLayers(net,layers);

layers = [
    sequenceInputLayer(1, Name="key-value")
    fullyConnectedLayer(numChannels,Name="fc-key")];

net = addLayers(net,layers);
net = connectLayers(net,"fc-key","attention/key");

net = addLayers(net, fullyConnectedLayer(numChannels,Name="fc-value"));
net = connectLayers(net,"key-value","fc-value");
net = connectLayers(net,"fc-value","attention/value");

View the network in a plot.

figure
plot(net)

Figure contains an axes object. The axes object contains an object of type graphplot.

Algorithms

expand all

Dot-Product Attention

The attention operation focuses on parts of the input using weighted multiplication operations.

The single-head dot-product attention operation is given by

$attention (Q, K, V) = dropout (softmax (mask (λ Q K^{⊤}, M)), p) V,$

where:

Q denotes the queries.
K denotes the keys.
V denotes the values.
$λ$ denotes the scaling factor.
M is a mask array of ones and zeros.
p is the dropout probability.

The mask operation includes or excludes the values of the matrix multiplication by setting values of the input to $- \infty$ for zero-valued mask elements. The mask is the union of the padding and attention masks. The softmax function normalizes the value of the input data across the channel dimension such that it sums to one. The dropout operation sets elements to zero with probability p.

Multihead Dot-Product Attention

The multihead dot-product attention operation is given by

$multiheadAttention (Q, K, V) = concatenate ({head}_{1}, \dots, {head}_{h}),$

where:

h is the number of heads.

Each ${head}_{i}$ denotes the output of the head operation given by

${head}_{i} = attention (Q_{i}, K_{i}, V_{i}) .$

Layer Input and Output Formats

Layers in a layer array or layer graph pass data to subsequent layers as formatted dlarray objects. The format of a dlarray object is a string of characters in which each character describes the corresponding dimension of the data. The format consists of one or more of these characters:

"S" — Spatial
"C" — Channel
"B" — Batch
"T" — Time
"U" — Unspecified

For example, you can describe 2-D image data that is represented as a 4-D array, where the first two dimensions correspond to the spatial dimensions of the images, the third dimension corresponds to the channels of the images, and the fourth dimension corresponds to the batch dimension, as having the format "SSCB" (spatial, spatial, channel, batch).

You can interact with these dlarray objects in automatic differentiation workflows, such as those for developing a custom layer, using a functionLayer object, or using the forward and predict functions with dlnetwork objects.

This table shows the supported input formats of AttentionLayer objects and the corresponding output format. If the software passes the output of the layer to a custom layer that does not inherit from the nnet.layer.Formattable class, or a FunctionLayer object with the Formattable property set to 0 (false), then the layer receives an unformatted dlarray object with dimensions ordered according to the formats in this table. The formats listed here are only a subset. The layer may support additional formats such as formats with additional "S" (spatial) or "U" (unspecified) dimensions.

Query, Key, and Value Format	Output Format	Scores Output Format (When `HasScoresOutput` is `1` (`true`)
`"CB"` (channel, batch)	`"CB"` (channel, batch)	`"UUUU"` (unspecified, unspecified, unspecified, unspecified)
`"SCB"` (spatial, channel, batch)	`"SCB"` (spatial, channel, batch)	`"UUUU"` (unspecified, unspecified, unspecified, unspecified)
`"CBT"` (channel, batch, time)	`"CBT"` (channel, batch, time)	`"UUUU"` (unspecified, unspecified, unspecified, unspecified)
`"SC"` (spatial, channel)	`"SC"` (spatial, channel)	`"UUU"` (unspecified, unspecified, unspecified)
`"CT"` (channel, time)	`"CT"` (channel, time)	`"UUU"` (unspecified, unspecified, unspecified)
`"BT"` (batch, time)	`"CBT"` (channel, batch, time)	`"UUUU"` (unspecified, unspecified, unspecified, unspecified)
`"SB"` (spatial, batch)	`"SCB"` (spatial, channel, batch)	`"UUUU"` (unspecified, unspecified, unspecified, unspecified)

If HasMaskInput is 1 (true), then the mask must have the same format as the queries, keys, and values.

References

[1] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc., 2017. https://papers.nips.cc/paper/7181-attention-is-all-you-need.

Extended Capabilities

expand all

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

Usage notes and limitations:

Code generation is not supported when HasScoresOutput is set to true.
Code generation does not support passing dlarray objects with unspecified (U) dimensions to this layer.

GPU Code Generation
Generate CUDA® code for NVIDIA® GPUs using GPU Coder™.

Refer to the usage notes and limitations in the C/C++ Code Generation section. The same limitations apply to GPU code generation.

Version History

Introduced in R2024a

attentionLayer

Description

Creation

Syntax

Description

Properties

Attention

`NumHeads` — Number of heads
positive integer

`Scale` — Multiplicative factor for scaling dot product of queries and keys
`"auto"` | numeric scalar

`HasPaddingMaskInput` — Flag indicating whether layer has mask input
`0` (`false`) (default) | `1` (`true`)

`HasScoresOutput` — Flag indicating whether layer has scores output
`0` (`false`) (default) | `1` (`true`)

`AttentionMask` — Attention mask
`"none"` (default) | `"causal"` | numeric array | logical array

`DropoutProbability` — Dropout probability for attention scores
`0` (default) | scalar in the range [0, 1)

Layer

`Name` — Layer name
`''` (default) | character vector | string scalar

`NumInputs` — Number of inputs
`3` (default) | `4`

`InputNames` — Input names
`["query" "key" "value"]` (default) | `["query" "key" "value" "mask"]`

`NumOutputs` — Number of outputs
Read-only: `1` (default) | `2`

`OutputNames` — Output names
Read-only: `"out"` (default) | `["out" "scores"]`

Examples

Create Attention Layer

Create Cross-Attention Neural Network

Algorithms

Dot-Product Attention

Multihead Dot-Product Attention

Layer Input and Output Formats

References

Extended Capabilities

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

GPU Code Generation
Generate CUDA® code for NVIDIA® GPUs using GPU Coder™.

Version History

See Also

Topics

attentionLayer

Description

Creation

Syntax

Description

Properties

Attention

NumHeads — Number of heads positive integer

Scale — Multiplicative factor for scaling dot product of queries and keys "auto" | numeric scalar

HasPaddingMaskInput — Flag indicating whether layer has mask input 0 (false) (default) | 1 (true)

HasScoresOutput — Flag indicating whether layer has scores output 0 (false) (default) | 1 (true)

AttentionMask — Attention mask "none" (default) | "causal" | numeric array | logical array

DropoutProbability — Dropout probability for attention scores 0 (default) | scalar in the range [0, 1)

Layer

Name — Layer name '' (default) | character vector | string scalar

NumInputs — Number of inputs 3 (default) | 4

InputNames — Input names ["query" "key" "value"] (default) | ["query" "key" "value" "mask"]

NumOutputs — Number of outputs Read-only: 1 (default) | 2

OutputNames — Output names Read-only: "out" (default) | ["out" "scores"]

Examples

Create Attention Layer

Create Cross-Attention Neural Network

Algorithms

Dot-Product Attention

Multihead Dot-Product Attention

Layer Input and Output Formats

References

Extended Capabilities

C/C++ Code Generation Generate C and C++ code using MATLAB® Coder™.

GPU Code Generation Generate CUDA® code for NVIDIA® GPUs using GPU Coder™.

Version History

See Also

Topics

`NumHeads` — Number of heads
positive integer

`Scale` — Multiplicative factor for scaling dot product of queries and keys
`"auto"` | numeric scalar

`HasPaddingMaskInput` — Flag indicating whether layer has mask input
`0` (`false`) (default) | `1` (`true`)

`HasScoresOutput` — Flag indicating whether layer has scores output
`0` (`false`) (default) | `1` (`true`)

`AttentionMask` — Attention mask
`"none"` (default) | `"causal"` | numeric array | logical array

`DropoutProbability` — Dropout probability for attention scores
`0` (default) | scalar in the range [0, 1)

`Name` — Layer name
`''` (default) | character vector | string scalar

`NumInputs` — Number of inputs
`3` (default) | `4`

`InputNames` — Input names
`["query" "key" "value"]` (default) | `["query" "key" "value" "mask"]`

`NumOutputs` — Number of outputs
Read-only: `1` (default) | `2`

`OutputNames` — Output names
Read-only: `"out"` (default) | `["out" "scores"]`

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

GPU Code Generation
Generate CUDA® code for NVIDIA® GPUs using GPU Coder™.