Pocket Guide

Access and Collect Labeled Data for Deep Learning

Why Is So Much Training Data Necessary?

Deep learning networks try to classify abstract patterns without having experience or existing knowledge to draw from. Deep learning needs more training data than traditional methods to offset humans’ domain knowledge. Your network will be only as good as the labeled data you provide. Several methods exist for acquiring labeled data.

Knowledge vs Size graph

Collect Your Own Data

You can build a database from scratch by collecting data from sensors. This is a good option in some cases, such as with autonomous vehicles, because billions of vehicles are on the road. Collecting your own data seems straightforward at first, but you need to consider collecting data across the entire solution space and labeling that data.

Data collection diagram

Access and Augment Existing Data

You can find all the required labeled data in an existing database. For example, for image classification you can use ImageNet. If an existing database doesn’t contain all needed training data, you can augment the data set by duplicating it with adjusted speech frequency and scaled and rotated images.

Character recognition diagram

Synthesize Data

If you understand the physics of your problem well enough, you can build a simulation to synthesize training data. A benefit of this approach is that the data is already labeled. Synthesized data can also be used when it is too expensive or difficult to collect real data.

Simulation physics model

Example: Synthesizing Waveform Data

RF modulation schemes and the impairments that produce noise on them are well known, so they are perfect candidates for synthesized training data. The real test is how well a network trained on synthesized data can label actual RF data.