Chapter 2: CREATING ANNOTATED DATA TO TRAIN AND VALIDATE MODELS
Chapter 2
Creating Annotated Data to Train and Validate Models
Training, Validation, and Testing Sets: What’s the Difference?
Most of your data should be reserved for training. The training set is used by the backpropagation algorithm to optimize the large number of network weights by fitting the input data to the output annotations provided. Training data sets tend to be so large that they often include pre-existing or simulated data. The training set is used to train and fit the data by optimizing weight values so your model learns what it should consider important.
Validation data is also used when you train your model. This data is used to continuously check how well your model generalizes to new data as you train and will help choose between models. The validation set isn’t used by any data-hungry optimization algorithm, and it generally can be much smaller than the training set. In modern systems, validation data is often made as realistic as possible—creating validation data often involves acquiring new real-world signals and annotating them afresh.
You use the testing set to calculate the performance of the model after a round of training is complete. Like the validation set, the testing set is often as realistic as possible, and creating fresh data often involves acquiring new real-world signals and annotating them afresh.
To create a working deep learning model, you most often need to have at least three different types of data to develop your model—data to train it, data to validate that it is genuinely learning, and data to test its final performance.
Importing Data Sets
Given the amount of data necessary to train a deep learning model, it’s important to consider memory constraints and data management. If you can’t fit all your data in memory, you will need a way to represent your stored data without reading it all in one go. One way to do this in MATLAB is to use datastores like audioDatastore
(requires Audio Toolbox™) or signalDatastore
(requires Signal Processing Toolbox™ and Deep Learning Toolbox™). These datastores help manage in- or out-of-memory signals and process signals to extract features using parallel pools.
Labeling
Labeling, or annotating, data correctly is necessary for your model to be calibrated correctly. It is important for your validation and testing data sets to have accurate labels because these sets are how you judge how your model is performing—while in training and then once training is complete. Your training data labels are also important, but given the size of the training data set, these labels are often handled with a different set of techniques.
This section starts with labels for validation and testing data sets, and then looks at training data sets.
Validation and Testing Data
Validation data needs to accurately represent the data that the network will see in the final application. First, it must include signals that closely represent the problem that you are trying to solve. For an audio application, this might include recording signals with the same microphone in quiet environments and then with varying levels of noise, echo, and reverberation.
The validation data should include some high-quality labels, possibly added manually. That is what you want the network to learn. In this case, that's the red mask plotted on top.
If you played back just the regions with the keyword mask, it would sound like this:
How Can You Achieve Good Quality Labels for Your Validation and Test Data?
Try using an intelligent system trained to carry out a similar task with proven accuracy; often this means manually labeling data or using a pretrained machine learning model.
You can make manual labeling easier with an interactive app like Signal Labeler. Interactive apps provide an interface to select regions in signals and assign labels, adjust selected segments, and other tasks.
Another option is to use a working model developed by someone else. Here’s an example using Google’s well-known speech-to-text service through their cloud API interface.
To create a mask label for trigger words, you can export the word label to the MATLAB command line, and after a few lines of code you can have the labels you need. See the code for this example.
Here you can play back the annotated segments.
Training Data
When working with signal data such as audio recordings, it is unrealistic to record terabytes of good-quality data and accurately label it manually. One way around this is to use existing labeled recordings, possibly for a slightly different problem. A research data set that you can license is also a good option.
Label Spoken Words in Audio Signals Using External API
Test the Signal Labeler app for yourself with the IBM® Watson Speech to Text API.
It’s okay if the training set is not tailor-made for your application; however, the bigger the difference between training and validation data, the larger the accuracy gap will be.
It is good to have a few extra techniques handy like automated labeling algorithms to get you started.
Audio Toolbox in particular has many automatic labeling functions, including detectSpeech
and speech2text
, that can help with labeling. Similarly, Signal Processing Toolbox supports bulk automatic labeling using Peak Labeler and custom functions.
The next chapter covers more techniques to improve the quality and grow the size of your training data, such as data augmentation and synthesis.
Test Your Knowledge
How might someone accurately label validation and test data?
Incorrect. All of the answers listed could be used to label validation and test data: assign labels manually with the help of an interactive app, automatically apply labels with bulk labeling functions, and apply a pretrained machine learning model.
Correct!
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)