Chapter 4
Creating Inputs for Deep Networks
Learning directly from raw data is called end-to-end learning. Modern deep learning systems often do use end-to-end learning for image and computer vision problems. However, for signal data, end-to-end learning is only very rarely used in practice.
For almost all practical signal processing applications, you would use some kind of feature extraction technique to reduce dimensionality and size.
Here is what these steps look like for two deep learning networks: long short-term memory (LTSM) networks and convolutional neural networks (CNNs). Note the difference in the input preprocessing that has to take place based on the method you choose.
Long Short-Term Memory (LSTM) Networks
Convolutional Neural Networks (CNNs)
Different Applications Require Different Transformation Techniques
When extracting features from signals, it is often good practice to organize signals in buffers, or frames, which may overlap to get better time resolution.
A large group of feature extraction techniques, also known as time-frequency transformations, are based on transforming each of those buffers in the frequency domain via a fast Fourier transform (FFT) or similar operation.
The simplest types of output that you get this way would be called short-time Fourier transform (STFT) and spectrograms. You find these used very often with network types that were originally designed to work with images since these are 2D signal representations.
Some common output types:
Basic spectogram
Easy to understand and implement
Perceptually spaced (e.g., Mel, Bark) spectrogram
Compact for speech and audio applications
Wavelet scalogram
Good time resolution, useful for non-periodic signals
Constant Q transform
Good resolution at low frequencies
When working with speech or other audible signals, you tend to see more advanced types of spectrograms in which frequency is scaled to resemble the way people perceive frequencies. In many cases, time-frequency characteristics of signals are much more distinctive than their time waveforms. Even to the human eye, it is easier to identify features in the time-frequency domain.
To Recap
Once you have labeled a data set by extracting features, you end up with a collection of feature arrays, each associated with their original label, and those are what you use to train your network.
Data augmentation will help compute features from more data than the one in your original dataset. You may need the help of acceleration hardware, such as multicore machines or GPUs, to compute features.
Many MATLAB feature extraction functions support running on GPU. You’ll see more speed gains the more data processed by the functions.
To learn more about processing signals on GPUs, type “gpuArray
” into MATLAB.
Once your network is trained, try it out.
The trained network predicts a trigger word mask based on features, and transitions to generate a chime as output. The trained network is now used with the function classify
to predict a trigger word mask based on a set of features and some code to detect mask transitions and to generate a chime sound at the right time, as output.
Extract Features
% Extract MFCC from whole analysis buffer [coeffs,delta,deltaDelta] = mfcc(buf,SampleRate,... 'WindowLength',winLength,... 'OverLength',ovlpLength); % Concatenate and normalize features featureMatrix = [coeffs,delta,deltaDelta]; featureMatrix = (featureMatrix - M)./S;
Inference
% Detect Keyword with LSTM network (Mark around speech keyword)
featMask = classify(net,featureMatrix.');
Trigger
% Debounce and re-align detection in time domain (timeMask, chimePosition) = debounceAnalyzeDetectionMask(featMask); % Generate chimes for detection events chime = generateChimeAtSample(chimePosition,...
Test Your Knowledge
Which type of time-frequency transformation provides particularly good resolution for low-frequency signals?
Incorrect! Constant Q transforms provide good resolution for low-frequency signals.
Correct!
What do you need to develop an AI-powered signal processing application?
A simple, proven deep learning model as well as a lot of data, some domain expertise, and the right tools for the specific application in hand. It’s important to remember that deep learning systems can only be as good as the data used to train them.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)