Data and Modeling in AI-Powered Signal Processing Applications

CHAPTERS

Chapter 4 Creating Inputs for Deep Networks

Learning directly from raw data is called end-to-end learning. Modern deep learning systems often do use end-to-end learning for image and computer vision problems. However, for signal data, end-to-end learning is only very rarely used in practice.

For almost all practical signal processing applications, you would use some kind of feature extraction technique to reduce dimensionality and size.

Here is what these steps look like for two deep learning networks: long short-term memory (LTSM) networks and convolutional neural networks (CNNs). Note the difference in the input preprocessing that has to take place based on the method you choose.

Long Short-Term Memory (LSTM) Networks

Workflow that includes, from left to right, a label, domain-specific feature extraction, and several L S T M instances.

Convolutional Neural Networks (CNNs)

Workflow that includes, from left to right, a label, time-frequency transformation, and multiple layers with convolution, R e l U and pooling, followed by fully connected layers to support classification.

Different Applications Require Different Transformation Techniques

When extracting features from signals, it is often good practice to organize signals in buffers, or frames, which may overlap to get better time resolution.

A large group of feature extraction techniques, also known as time-frequency transformations, are based on transforming each of those buffers in the frequency domain via a fast Fourier transform (FFT) or similar operation.

The simplest types of output that you get this way would be called short-time Fourier transform (STFT) and spectrograms. You find these used very often with network types that were originally designed to work with images since these are 2D signal representations.

Some common output types:

Basic spectogram
Easy to understand and implement

A visual representation of signal data transformed into 2 D image data using a basic spectrogram.

Perceptually spaced (e.g., Mel, Bark) spectrogram
Compact for speech and audio applications

A visual representation of signal data transformed into 2 D image data using a perceptually spaced spectrogram.

Wavelet scalogram
Good time resolution, useful for non-periodic signals

A visual representation of signal data transformed into 2 D image data using a wavelet scalogram.

Constant Q transform
Good resolution at low frequencies

A visual representation of signal data transformed into 2 D image data using a constant Q transform.

When working with speech or other audible signals, you tend to see more advanced types of spectrograms in which frequency is scaled to resemble the way people perceive frequencies. In many cases, time-frequency characteristics of signals are much more distinctive than their time waveforms. Even to the human eye, it is easier to identify features in the time-frequency domain.

To Recap

Once you have labeled a data set by extracting features, you end up with a collection of feature arrays, each associated with their original label, and those are what you use to train your network.

Data augmentation will help compute features from more data than the one in your original dataset. You may need the help of acceleration hardware, such as multicore machines or GPUs, to compute features.

Many MATLAB feature extraction functions support running on GPU. You’ll see more speed gains the more data processed by the functions.

To learn more about processing signals on GPUs, type “gpuArray” into MATLAB.

Diagram showing sets of labeled signals. Arrows come out of them and are labeled ‘extract features.’ This leads to a corresponding set of feature arrays, with an arrow to ‘train’ pointing to a network icon. — By extracting features, you end up with a collection of feature arrays with their original label.

Diagram showing sets of labeled signals, each with arrows labeled ‘augment’ and layered with more signals. From these comes arrows labeled ‘extract features.’ This leads to a set of feature arrays, with arrows to ‘train’ pointing to a network icon. — Data augmentation increases the data you extract features from, while preserving the original labels.

Once your network is trained, try it out.

The trained network predicts a trigger word mask based on features, and transitions to generate a chime as output. The trained network is now used with the function classify to predict a trigger word mask based on a set of features and some code to detect mask transitions and to generate a chime sound at the right time, as output.

Extract Features

% Extract MFCC from whole analysis buffer
[coeffs,delta,deltaDelta] = mfcc(buf,SampleRate,...
   'WindowLength',winLength,...
   'OverLength',ovlpLength);
 
% Concatenate and normalize features
featureMatrix = [coeffs,delta,deltaDelta];
featureMatrix = (featureMatrix - M)./S;

Inference

% Detect Keyword with LSTM network (Mark around speech keyword)
featMask = classify(net,featureMatrix.');

Trigger

% Debounce and re-align detection in time domain
(timeMask, chimePosition) = debounceAnalyzeDetectionMask(featMask);
 
% Generate chimes for detection events
chime = generateChimeAtSample(chimePosition,...

Test Your Knowledge

Start quiz

What do you need to develop an AI-powered signal processing application?

A simple, proven deep learning model as well as a lot of data, some domain expertise, and the right tools for the specific application in hand. It’s important to remember that deep learning systems can only be as good as the data used to train them.

PREVIOUS
Chapter 3: Data Augmentation and Synthesis

Ebook