VGGish Embeddings

Extract VGGish embeddings

Since R2022a

Libraries:
Audio Toolbox / Deep Learning

Description

The VGGish Embeddings block uses VGGish to extract feature embeddings from audio segments. The VGGish Embeddings block combines necessary audio preprocessing and VGGish network inference and returns feature embeddings that are a compact representation of audio data.

Examples

Use VGGish Embeddings for Deep Learning in Simulink

Use a neural network in Simulink^® to classify audio signals from their VGGish feature embeddings.

Open Model

Compare VGGish Embeddings Block with Equivalent VGGish Blocks

Show that VGGish Embeddings block is equivalent to the cascade of VGGish Preprocess block and VGGish block.

Open Model

Ports

Input

expand all

Port_1 — Sound data
column vector

Sound data, specified as a one-channel signal (column vector). If Sample rate of input signal (Hz) is 16e3, there are no restrictions on the input frame length. If Sample rate of input signal (Hz) is different from 16e3, then the input frame length must be a multiple of the decimation factor of the resampling operation that the block performs. If the input frame length does not satisfy this condition, the block throws an error message with information on the decimation factor.

Data Types: single | double

Output

expand all

Port_1 — Embeddings
row vector of length 128

VGGish feature embeddings, returned as a row vector of length 128. The feature embeddings are a compact representation of audio data.

Data Types: single

Parameters

expand all

Sample rate of input signal (Hz) — Sample rate of input signal in Hz
`16e3` (default) | positive scalar

Sample rate of the input signal in Hz, specified as a positive scalar.

Overlap percentage (%) — Overlap percentage between consecutive mel spectrograms
`50` (default) | [0 100)

Specify the overlap percentage between consecutive mel spectrograms as a scalar in the range [0 100).

Block Characteristics

Data Types	`double` \| `single`
Direct Feedthrough	`no`
Multidimensional Signals	`no`
Variable-Size Signals	`no`
Zero-Crossing Detection	`no`

Algorithms

expand all

Preprocessing Steps

The VGGish Embeddings block preprocesses the audio data using the following steps to be in the format required by the VGGish network.

Cast the audio data to single precision and resample to 16 kHz.
Compute one-sided short-time Fourier transform using a 25 ms periodic Hann window (400 samples) with a 10 ms hop (160 samples) and a 512-point DFT.
Convert the complex spectral values to magnitude and discard phase information.
Pass the one-sided magnitude STFTs through a 64-band mel-spaced filter bank. Doing so converts the 257-length STFT vectors to 64-length vectors in the mel scale.
Convert the 64-length vectors to a log scale.
Buffer the vectors into outputs of size 96-by-64, where 96 is the number of spectra in the mel spectrogram and 64 is the number of mel bands. The overlap between consecutive 96-by-64 mel spectrograms is determined by the value of the Overlap percentage (%) parameter.

References

[1] Gemmeke, Jort F., Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. “Audio Set: An Ontology and Human-Labeled Dataset for Audio Events.” In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 776–80. New Orleans, LA: IEEE, 2017. https://doi.org/10.1109/ICASSP.2017.7952261.

[2] Hershey, Shawn, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, et al. “CNN Architectures for Large-Scale Audio Classification.” In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 131–35. New Orleans, LA: IEEE, 2017. https://doi.org/10.1109/ICASSP.2017.7952132.

Extended Capabilities

C/C++ Code Generation
Generate C and C++ code using Simulink® Coder™.

Usage notes and limitations:

To generate generic C code that does not depend on third-party libraries, in the Configuration Parameters > Code Generation general category, set the Language parameter to C.
To generate C++ code, in the Configuration Parameters > Code Generation general category, set the Language parameter to C++. To specify the target library for code generation, in the Code Generation > Interface category, set the Target Library parameter. Setting this parameter to None generates generic C++ code that does not depend on third-party libraries.
For a list of networks and layers supported for code generation, see Networks and Layers Supported for Code Generation (MATLAB Coder).

Version History

Introduced in R2022a

VGGish Embeddings

Description

Examples

Use VGGish Embeddings for Deep Learning in Simulink

Compare VGGish Embeddings Block with Equivalent VGGish Blocks

Ports

Input

Port_1 — Sound data
column vector

Output

Port_1 — Embeddings
row vector of length 128

Parameters

Sample rate of input signal (Hz) — Sample rate of input signal in Hz
`16e3` (default) | positive scalar

Overlap percentage (%) — Overlap percentage between consecutive mel spectrograms
`50` (default) | [0 100)

Block Characteristics

Algorithms

Preprocessing Steps

References

Extended Capabilities

C/C++ Code Generation
Generate C and C++ code using Simulink® Coder™.

Version History

See Also

Apps

Blocks

Functions

VGGish Embeddings

Description

Examples

Use VGGish Embeddings for Deep Learning in Simulink

Compare VGGish Embeddings Block with Equivalent VGGish Blocks

Ports

Input

Port_1 — Sound data column vector

Output

Port_1 — Embeddings row vector of length 128

Parameters

Sample rate of input signal (Hz) — Sample rate of input signal in Hz 16e3 (default) | positive scalar

Overlap percentage (%) — Overlap percentage between consecutive mel spectrograms 50 (default) | [0 100)

Block Characteristics

Algorithms

Preprocessing Steps

References

Extended Capabilities

C/C++ Code Generation Generate C and C++ code using Simulink® Coder™.

Version History

See Also

Apps

Blocks

Functions

Port_1 — Sound data
column vector

Port_1 — Embeddings
row vector of length 128

Sample rate of input signal (Hz) — Sample rate of input signal in Hz
`16e3` (default) | positive scalar

Overlap percentage (%) — Overlap percentage between consecutive mel spectrograms
`50` (default) | [0 100)

C/C++ Code Generation
Generate C and C++ code using Simulink® Coder™.