This comes up as a very common question in ##machinelearning chat room – How do you chose the right parameters for your Neural Network model? How do you chose how many hidden layers to use, or what your drop out parameter should be, or what the learning rate should be, and so on?

Or in other fields, say you’re doing an engineering or science experiment, how do you chose the right parameters to test?

This problem is called **hyperoptimization**.

One possible way is to do a “grid search”. You keep all the parameters constant, then vary one by one and see what effect it has. You adjust a different parameter, then repeat on the first parameter, trying to either to test out every possible combination. The trouble is that the number of combinations blows dramatically, to the power of the number of parameters.

But never fear – there are smarter ways to do this, including random search, Bayesian Search, and Tree of Parzen Estimators (TPE). The good thing is that you don’t actually need to know any of this is, in order to actually use it.

Let’s take a toy example, and since it’s valentines at the time of writing, let’s imagine instead our error space is:

Where the axes are filling in for our hyperparameters (e.g. learning rate, size of hidden layer), and the z value is our error value (e.g. percentage of cat photos incorrectly identified as dogs, etc. Something that we want to minimize).

This diagram was generated with the notebook python code:

%matplotlib notebook import matplotlib.pyplot as plt import numpy as np def heart(x, y): return -((x**2 + y**2 - 1)**3 - x**2 *y**3); x = y = np.arange(-2, 2, 0.01) X, Y = np.meshgrid(x,y) Z = heart(X, Y) plt.figure() def plotHeart(): CS = plt.contour(X, Y, Z, levels=[-0.5, -0.2, -0.1, 0, 1, 20], colors=["0.9", "0.9", "0.9", "red", "0.9", "0.9"]) plt.clabel(CS, inline=1, fontsize=10) plotHeart()

So, say you’ve written a Tensorflow, or keras, or scipy, etc program in Python. For example, tensorflow/mnist.py. Have a quick look at that – see how they’ve got various hyperparameters that need tuning. There’s some complicated error function that we can’t directly visualize, in some high dimensional hyperparameter space. And we’re trying to find the parameters that minimize this error.

Here’s my dummy program that will stand in for your complex number or cat recognition program:

class HyperParameters: def __init__(self): self.learning_rate = 1 self.size_of_hidden_layer = 1 def runAndGetError(self): """This function should call your tensorflow etc code which does the whole training procedure and returns the TRAINING error. Lower is better!""" return heart(self.learning_rate, self.size_of_hidden_layer) hyperparameters = HyperParameters(); def main(): pass # This is where you do all the normal training code etc if __name__ == "__main__": main()

Note that we’ve got a class for all our hyperparameters. I recommend you having this! Note also that we don’t actual run the training unless __name__ == “__main__”. The point here is to make sure that our program runs as normal if we just run it directly, but we can also create a second program that uses hyperopt to tune this without having to touch our main program at all. This sort of separation is really nice.

You can put discrete ints, category ints, real, bools, etc in the

Don’t put any hyperopt code in your main program!

Now, without touching your main program at all, we create our hyperopt tuner:

#!/usr/bin/python3 from hyperopt import fmin, tpe, hp, STATUS_OK, Trials import heart.hyperparameters as hyperparameters # heart.py in the same folder import numpy as np import signal import sys import pickle # The key in the space must match a variable name in HyperParameters space = { 'learning_rate': hp.uniform('learning_rate', -2, 2), 'size_of_hidden_layer': hp.uniform('size_of_hidden_layer', -2, 2), } num_trials = 10 trials = Trials() def summarizeTrials(): print() print() print("Trials is:", np.sort(np.array([x for x in trials.losses() if x is not None]))) def main(): global trials loadFromPickle = True try: if loadFromPickle: trials = pickle.load(open("hyperopt.p", "rb")) except: print("Starting new trials file") def objective(args): for key, value in args.items(): if int(value) == value: value = int(value) setattr(hyperparameters, key, value) score = hyperparameters.runAndGetError() #print("Score:", score, "for", args) return {'loss': score, 'status': STATUS_OK} for i in range(num_trials): best = fmin(objective, space=space, algo=tpe.suggest, max_evals=(i+1), trials=trials) save_trials() summarizeTrials() print(i, "Best result was: ", best) def save_trials(): pickle.dump(trials, open("hyperopt.p", "wb")) def signal_handler(signal, frame): summarizeTrials() save_trials() sys.exit(0) signal.signal(signal.SIGINT, signal_handler) if __name__ == '__main__': main()

And now let’s plot this to see what is going on and see how hyperopt is going to sample from our space:

results = [ (x['misc']['vals']['learning_rate'][0], x['misc']['vals']['size_of_hidden_layer'][0]) for x in trials.trials if 'loss' in x['result']] plt.figure() for i, result in enumerate(results): plt.annotate(i, xy = result, va="center", ha="center") plotHeart() plt.show()

These suggestions for trial evaluations look awful! Why is not sampling close to the middle, where there are known good results?

What on earth is hyperopt doing?

It hasn’t even come close to the minimum points (at 0,0), while with just 10 samples, three of the samples are all bunched right next to each other. On repeated runs, sometimes 50% of the points are right on the border too!

Even with 50 samples you end up with clumping, and even though it finds an excellent value at 2, it doesn’t improve on that even after 48 samples!

There has to be something going on here.

# Strange Hyperopt behavior

Digging into the hyperopt code, turns out that there’s a:

_default_n_startup_jobs = 20

parameter. So for the first 20 runs, the ‘TPE’ behavior isn’t used at all. Digging around in the code reveals that this can be adjusted in our code by using:

from functools import partial ... best = fmin(objective,...algo=partial(tpe.suggest, n_startup_jobs=1),

Which now produces:

Which is looking a lot nicer!

I’m not very happy with the initial sampling – just look at the position of the first 5 trials, but this is already a lot better. But let’s try to crack open this black box a bit.

How it works.

**We split our experimental observations**into a small number of**Good Observations**, and the rest as the**Bad Observations**. They are split by how badly they did (e.g. percentage of images that they incorrectly classified). The uses uses the least-bad γ·√(num_observations) fraction for the good observations (obs_good) and the rest for the bad observations (obs_bad).This is done in the code with:obs_good, obs_bad = scope.ap_filter_trials(obs_x, obs_y, gamma)

**Pick random values of x.**but making it more likely to pick values of x which are similar to the x values in the**Good Observations**that we’ve already seen. So if x=3 and x=6 usually give a good result, we’re more likely to pick values close to that. This is mathematically done by sampling from a Gaussian Mixture Model comprised of Gaussians with a mean of each good observation’s x value.This is done in code with: (I’ve renamed the variables)

good_weights, good_mus, good_sigmas = adaptive_parzen_normal( obs_good, prior_weight, prior_mu, prior_sigma) x_samples = scope.GMM1(weights, mus, sigmas)

**Calculate p(x | y is good)**for our**Good Samples****and p(x | y is bad)**for the**Bad Samples**. This is done by getting the probability density of the Gaussian Mixture Model for each sample value. As an implementation note, hyperopt actually calculates the log for each of these values (llik = log likelihood).Now, in code, this is actually done in a very complicated way that I decided to devote its own post to.

So here instead is the

**equivalent**of what is done in code:bad_weights, bad_mus, bad_sigmas = adaptive_parzen_normal( obs_bad, prior_weight, prior_mu, prior_sigma) good_llik = GMM1_lpdf(x_samples, good_weights, good_mus, good_sigmas) bad_llik = GMM1_lpdf(x_samples, bad_weights, bad_mus, bad_sigmas)

**Find x value that maximizes the Expected Improvement**: Chose the x value from our x_samples that maximizes: p(x | y is good) / p(x | y is bad). This is mathematically equal to choosing the x value that maximizes log(p(x | y is good) – log(p(x | y is good).This is done in code with:

score = good_llik - bad_llik # Element-wise vector subtraction best = np.argmax(score) memo = x_samples[best]

**Repeat for each hyperparameter**. Note that this assumes that every hyperparameter is completely independent from the others, which is obviously false. To compensate, it weights older observations lower than newer ones, in the hope that it starts to converge all the parameters to good parameters that don’t change much.

Let’s put this all together, and graph what is going on:

def miscs_to_idxs_vals(miscs): keys = list(miscs[0]['idxs'].keys()) idxs = dict([(k, []) for k in keys]) vals = dict([(k, []) for k in keys]) for misc in miscs: for node_id in idxs: t_idxs = misc['idxs'][node_id] t_vals = misc['vals'][node_id] assert len(t_idxs) == len(t_vals) assert t_idxs == [] or t_idxs == [misc['tid']] idxs[node_id].extend(t_idxs) vals[node_id].extend(t_vals) return idxs, vals def get_lpdf_ast(parameter_name): idxs, vals = miscs_to_idxs_vals(trials.miscs) idxs = idxs[parameter_name] vals = vals[parameter_name] losses = trials.losses() from hyperopt import pyll from hyperopt.tpe import ap_filter_trials from hyperopt.tpe import adaptive_parzen_samplers qu = space[parameter_name].pos_args[0].pos_args[1] #pyll.scope.uniform(-2, 2) fn = adaptive_parzen_samplers['uniform'] fn_kwargs = dict(size=(4,), rng=np.random) s_below = pyll.Literal() s_above = pyll.Literal() b_args = [s_below, prior_weight] + qu.pos_args b_post = fn(*b_args, **fn_kwargs) a_args = [s_above, prior_weight] + qu.pos_args a_post = fn(*a_args, **fn_kwargs) start_val = space[parameter_name].pos_args[0].pos_args[1].pos_args[0].eval() end_val = space[parameter_name].pos_args[0].pos_args[1].pos_args[1].eval() x_samples = np.arange(start_val,end_val,0.1) # print b_post # print a_post fn_lpdf = getattr(pyll.scope, a_post.name + '_lpdf') # calculate the llik of b_post under both distributions a_kwargs = dict([(n, a) for n, a in a_post.named_args if n not in ('rng', 'size')]) b_kwargs = dict([(n, a) for n, a in b_post.named_args if n not in ('rng', 'size')]) below_llik = fn_lpdf(*([x_samples] + b_post.pos_args), **b_kwargs) above_llik = fn_lpdf(*([x_samples] + a_post.pos_args), **a_kwargs) new_node = pyll.scope.broadcast_best(x_samples, below_llik, above_llik) return s_below, s_above, below_llik, above_llik, new_node def get_lpdf(ii, s_below, s_above, below_llik, above_llik, new_node): below, above = ap_filter_trials(idxs[:ii], vals[:ii], idxs[:ii], losses[:ii], gamma) memo = {s_below: below, s_above: above} bl, al, nv = pyll.rec_eval([below_llik, above_llik, new_node], memo=memo) return bl, al, nv parameter_name = "learning_rate" s_below, s_above, below_llik, above_llik, new_node = get_lpdf_ast(parameter_name) plotAnnotatedHeart() plt.show() plt.figure() plt.text(0.06, 0.5, '⟵ Number of Trials', ha='center', va='center', rotation='vertical') num_subplots = 10 subplot_i =1 for ii in range(1, len(idxs)-1, 1+int(len(idxs)/num_subplots)): if ii > len(idxs): break bl, al, nv = get_lpdf(ii, s_below, s_above, below_llik, above_llik, new_node) plt.subplot(num_subplots, 1, subplot_i) subplot_i += 1 plt.plot(all_vals, bl, c='g') plt.plot(all_vals, al, c='r') plt.plot(all_vals, bl-al, c='k') plt.axvline(nv[0], c='k') plt.show()

Phew! That took… quite a lot of code! I wish hyperopt made this easier. I should try to make a patch for hyperopt to make it easier.

Anyway, so here’s the result (drum roll). First the loss function overlaid with where we are running our trials:

The graph on the bottom is for the x axis only variable of the heart graph. The x axis is the same, and the y axis is the **log(p(x|y is good) in green,** **log(p(x|y is bad)) in red,** and **log(p(x|y is good) – p(x|y is bad)) in black**. The vertical part indicates the largest value for the black line, which is the x value that will give the Maximum Improvement to the energy. We have one graph after every 10 trials, with the bottom graph being after 100 trials (aka observations).

We can start to see what is going wrong here. I don’t like that it’s not sampling the center enough, even though it can clearly see that the highest energy is there (the green peaks there). I don’t like that it tends to clump together measurements. I’m sure with a bit of thinking we can do better than this!

But I’ll think about it and leave that for another post.

# More realistic example

Here’s a hyperopt ‘space’ that I used for Q-Learning, to give an idea of the options for a more realistic model:

# The key in the space must match a variable name in HyperParameters space = { # Learning rate should be between 0.00001 and 1 'learning_rate': hp.loguniform('learning_rate', np.log(1e-5), np.log(1)), 'use_double_q_learning': hp.choice('use_double_q_learning', [True, False]), 'layer1_size': hp.quniform('layer1_size', 10, 100, 1), 'layer2_size': hp.quniform('layer2_size', 10, 100, 1), 'layer3_size': hp.quniform('layer3_size', 10, 100, 1), 'future_discount_max': hp.uniform('future_discount_max', 0.5, 0.99), 'future_discount_increment': hp.loguniform('future_discount_increment', np.log(0.001), np.log(0.1)), 'recall_memory_size': hp.quniform('recall_memory_size', 1, 100, 1), # This doesn't tune very well: # 'random_action_epsilon': # hp.loguniform('random_action_epsilon', np.log(0.0001), np.log(0.1)), 'recall_memory_num_experiences_per_recall': hp.quniform('recall_memory_num_experiences_per_recall', 10, 2000, 1), 'num_epochs': hp.quniform('num_epochs', 1, 10, 1), }