Applied Machine Learning, Part 3: Hyperparameter Optimization

From the series: Applied Machine Learning

Machine learning is all about fitting models to data. The models consist of parameters, and we find the value for those through the fitting process. This process typically involves some type of iterative algorithm that minimizes the model error. That algorithm has parameters that control how it works, and those are what we call hyperparameters.

In deep learning, we also call the parameters that determine the layer characteristics hyperparameters. Today, we’ll be talking about techniques for both.

So, why do we care about hyperparameters?  Well, it turns out that most machine learning problems are non-convex. This means that depending on the values we select for the hyperparameters, we might get a completely different model. By changing the values of the hyperparameters, we can find different, and hopefully better, models.  

Ok, so we know that we have hyperparameters, and we know we want to tweak them, but how do we do that? Some hyperparameters are continuous, some are binary, and others might take on any number of discrete values. This makes for a tough optimization problem. It is almost always impossible to run an exhaustive search of the hyperparameter space, since it takes too long.  

So, traditionally, engineers and researchers have used techniques for hyperparameter optimization like grid search and random search. In this example, I’m using a grid search method to vary 2 hyperparameters – Box Constraint and Kernel Scale – for an SVM model.  As you can see, the error of the resulting model is different for different values of the hyperparameters. After 100 trials, the search has found 12.8 and 2.6 to be the most promising values for these hyperparameters.

Recently, random search has become more popular than grid search. 

 “How could that be?” you may be asking.

Wouldn’t grid search do a better job of evenly exploring the hyperparameter space?  

Let’s imagine you have 2 hyperparameters, “A” and “B”. Your model is very sensitive to “A,” but not sensitive to “B.”  If we did a 3x3 grid search, we would only ever evaluate 3 different values of “A.” But if we did a random search, we would probably get 9 different values of “A”, even though some may be close together. As a result, we have a much better chance of finding a good value for “A.”  In machine learning, we often have many hyperparameters. Some have a big influence over the results, and some don’t.  So random search is typically a better choice.

Grid search and random search are nice because it’s easy to understand what’s going on.  However, they still require many function evaluations. They also don’t take advantage of the fact that, as we evaluate more and more combinations of hyperparameters, we learn how those values affect our results. For that reason, you can use techniques that create a surrogate model – or an approximation of the error as a function of the hyperparameters.

Bayesian optimization is one such technique. Here we see an example of a Bayesian optimization algorithm running, where each dot corresponds to a different combination of hyperparameters. We can also see the algorithm’s surrogate model, shown here as the surface, which it is using to pick the next set of hyperparameters.

One other really cool thing about Bayesian optimization is that it doesn’t just look at how accurate a model is. It can also take into account how long it takes to train.  There could be sets of hyperparameters that cause the training time to increase by factors of 100 or more, and that might not be so great if we’re trying to hit a deadline. You can configure Bayesian optimization in a number of ways, including expected improvement per second, which penalizes hyperparameter values that are expected to take a very long time to train.

Now, the main reason to do hyperparameter optimization is to improve the model.  And, although there are other things we could do to improve it, I like to think of hyperparameter optimizations as being a low-effort, high-compute type of approach. This is in contrast to something like feature engineering, where you have higher effort to create the new features, but you need less computational time. It’s not always obvious which activity is going to have the biggest impact, but the nice thing about hyperparameter optimization is it lends itself well to “overnight runs,” so you can sleep while your computer works.

That was a quick explanation of hyperparameter optimization. For more information, check out the links in the description.




Other Resources