Ebook

Chapter 3

Improving Quality and Quantity of Training Data


When is noise in your data a good thing? When it accurately reflects real-world conditions.

For speech and voice applications, typical existing large data sets will be recorded in ways that differ from real application scenarios. If your application is supposed to recognize a spoken trigger word, then it needs to cope with poor microphones, specific types of reverberation, and background noise.

These and other effects can be added artificially to grow a training data set using established signal processing methods and domain-specific applications through:

  • Data augmentation
  • Data synthesis

Signals can be difficult to measure consistently or observe to build a large data set; this chapter looks at techniques to create more training data. Data synthesis can help create new signals from models or simulations, and data augmentation is a specific type of data synthesis that creates new variations of your existing data.

First, a brief overview of how deep learning works with signal data.

section

Data Augmentation

Starting from existing labeled samples, augmentation generates:

  • Training data that is similar to your high-quality validation data
  • Variations of the available data that the system may encounter in real-world scenarios

Augmentation effects are often domain specific. Common augmentation effects for audio, speech, and acoustic data include stretch time, shift pitch, control volume, and many more.

Kitchen Reverberation

Kitchen reverberation signals with MATLAB code for augmenting data.

Washing Machine Noise

Washing machine noise signals with MATLAB code for augmenting data.
section

Synthesis

Data synthesis includes generating training data from scratch using a combination of AI generative models or simulations.

A few examples of domain-specific data synthesis include:

The text2speech function in MATLAB can help you generate high-quality synthetic voice signals by using cloud-based services by IBM®, Microsoft®, or Google®, including via Google’s well-known WaveNet network.

MATLAB Central File Exchange entry for the text2speech app from the MathWorks audio toolbox team.

This example shows how to classify pedestrians and bicyclists based on their micro-Doppler characteristics using a deep learning network and time-frequency analysis. The movements of different parts of an object placed in front of a radar produce micro-Doppler signatures that can be used to identify the object.

Two graphs: one is a plot of bicyclist trajectory, represented in dots forming a person on a bike. The other is a plot of speed on the y-axis versus time on the x-axis.

Communication signals are also very difficult to field-record off the air and then label. The WLAN Router Impersonation Detection example simulates realistic signals for RF fingerprinting. With the algorithm in place, you can use data collected from a software-defined radio to train and test the same system using actual data.

The figure shows three known routers, as well as the observer that collects non-high throughput (non-H T) beacon signals and unknown router data.

Test Your Knowledge