Chapter 2

Getting Started with Machine Learning


Rarely a Straight Line

With machine learning there’s rarely a straight line from start to finish—you’ll find yourself constantly iterating and trying different ideas and approaches. This section describes a systematic machine learning workflow, highlighting some key decision points along the way.

Real-world data sets can be messy, incomplete, and in a variety of formats. You may have simple numeric data. But sometimes you’re combining several different data types, such as sensor signals, text, and streaming images from a camera.

For example, to select features to train an object detection algorithm requires specialized knowledge of image processing. Different types of data require different approaches to preprocessing.

Choosing the right model is a balancing act. Highly flexible models tend to overfit data by modeling minor variations that could be noise. On the other hand, simple models may assume too much. There are always tradeoffs between model speed, accuracy, and complexity.

Every machine learning workflow begins with three questions:

  • What kind of data are you working with?
  • What insights do you want to get from it?
  • How and where will those insights be applied?

Your answers to these questions help you decide whether to use supervised or unsupervised learning.

Choose supervised learning if you need to train a model to make a prediction—for example, the future value of a continuous variable, such as temperature or a stock price, or a classification—for example, identify makes of cars from webcam video footage.

Choose unsupervised learning if you need to explore your data and want to train a model to find a good internal representation, such as splitting data up into clusters.

Workflow at a Glance

Download the full PDF to look at the steps in more detail, using a health monitoring app for illustration. The entire workflow will be completed in MATLAB®.

  1. ACCESS and load the data
  2. PREPROCESS the data
  3. DERIVE features using the preprocessed data
  4. TRAIN models using the features derived in step 3
  5. ITERATE to find the best model
  6. INTEGRATE the best-trained model into a production system