Practical Deep Neural Network in a few lines of code with TensorFlow

This is using the very latest TensorFlow tf.contrib.learn API.  The documentation is extremely thin, and as of writing there are no tutorials out there similar to this.  The API may change, and I don’t actually know if I’m doing things correctly.  This is just trial and error 🙂

So let’s say you want to:

  1. Train a Deep Neural Network by using some existing data, and training it against known labels.  e.g. labeled images.
  2. Use that trained network in your app to predict y given x.  E.g. guess the best label for a given image

    And as a bonus:

  3. Have some nice graphs of how well the training is doing

I assume that the data you have in a plain array-of-arrays or numpy 2D array.  It should be easy to change this to use csv or panda etc data. A row is a single example, and a column is a particular feature (e.g. house price, or pixel intensity at a specific location)

So, without further ado, here’s the function that will be shared between your training and your app

def get_tf_model():
    """ Setup our Deep Neural Network and LOAD any existing model
          ( Our (maybe trained) model, a helper input function)
    # FEATURES is a short description for each column
    # of your data.  Change this to match your data
    FEATURES = ["x1", "x2"]
    # Set this to describe your columns.  If they are all real values,
    # you don't need to change it.
    feature_columns = [tf.contrib.layers.real_valued_column(k) for k in FEATURES]

    # Build 3 layer DNN.  You can change this however you want, or use a
    # linear regressor, or use a classifier etc.
    # NOTE:
    #   This will LOAD any existing model in the "model_dir" directory!
    # The documentation fails to mention this point as of writing
    regressor = tf.contrib.learn.DNNRegressor(feature_columns=feature_columns,
                                              hidden_units=[128, 128, 128],

    def input_fn(x_data, y_data = None):
        # Note the 'shape' parameter is to suppress a very noisy warning.
        # You can probably remove this parameter in a month or two.
        feature_cols = {k: tf.constant(x_data[:,i], shape=[len(x_data[:,i]), 1])
                        for i, k in enumerate(FEATURES)}
        if y_data is None:
            return feature_cols
        labels = tf.constant(y_data)
        return feature_cols, labels

    return regressor, input_fn

Now the code for training. Note that it’s absolutely fine to carry on training a model that is already trained, if you have some new data for it.

x_test = None
y_test = None
regressor, input_fn = get_tf_model()

def train(training_data_y, training_data_x):
    global x_test, y_test, regressor, input_fn

    if x_test is None:
        x_train = np.array(training_data_x[:-20], dtype=np.float32)
        x_test = np.array(training_data_x[-20:], dtype=np.float32)
        y_train = np.array(training_data_y[:-20], dtype=np.float32)
        y_test = np.array(training_data_y[-20:], dtype=np.float32)
        x_train = np.array(training_data_x, dtype=np.float32)
        y_train = np.array(training_data_y, dtype=np.float32)

    print("Training model ...")
    # Fit model. input_fn(x_train, y_train), steps=2000)

    ev = regressor.evaluate(input_fn=lambda: input_fn(x_test, y_test), steps=1)
    print('  -  Trained Loss: {0:f}'.format(ev["loss"]))

And now finally, the code to use this model in the app. Note that no explicit loading is needed, because it’s loading it from the model_dir

regressor, input_fn = get_tf_model()
def predict(x_data):
    return regressor.predict(input_fn=lambda:input_fn(x_data))

Isn’t that simple?

We can also view a graph of the loss etc with:

tensorboard tf_model  # or whatever you set model_dir to

Then navigating to in the browser


Worst/Trickiest code I have ever seen

It’s easy to write bad code, but it takes a real genius to produce truly terrible code.  And the guys who wrote the python program hyperopt were clearly very clever.

Have a look at this function:  (don’t worry about what it is doing) from

# These produce conditional estimators for various prior distributions
def ap_uniform_sampler(obs, prior_weight, low, high, size=(), rng=None):
    prior_mu = 0.5 * (high + low)
    prior_sigma = 1.0 * (high - low)
    weights, mus, sigmas = scope.adaptive_parzen_normal(obs,
        prior_weight, prior_mu, prior_sigma)
    return scope.GMM1(weights, mus, sigmas, low=low, high=high, q=None,
size=size, rng=rng)

The details don’t matter here, but clearly it’s calling some function “adaptive_parzen_normal”  which returns three values, then it passes that to another function called “GMM1”  and returns the result.

Pretty straight forward?  With me so far?  Great.

Now here is some code that calls this function:

fn = adaptive_parzen_samplers[]
named_args = [[kw, memo[arg]] for (kw, arg) in node.named_args]
a_args = [obs_above, prior_weight] + aa
a_post = fn(*a_args, **dict(named_args))

Okay this is getting quite messy, but with a bit of thinking we can understand it.  It’s just calling the  ‘ap_uniform_sampler’  function, whatever that does, but letting us pass in parameters in some funky way.

So a_post is basically whatever “GMM1” returns  (which is a list of numbers, fwiw)

Okay, let’s continue!

fn_lpdf = getattr(scope, + '_lpdf')
a_kwargs = dict([(n, a) for n, a in a_post.named_args if n not in ('rng', 'size')])
above_llik = fn_lpdf(*([b_post] + a_post.pos_args), **a_kwargs)

and that’s it.  There’s no more code using a_post.

This took me a whole day to figure out what on earth is going on.  But I’ll give you, the reader, a hint.  This is not running any algorithm – it’s constructing an Abstract Syntax Tree and manipulating it.

If you want, try and see if you can figure out what it’s doing.

Answer: Continue reading

Python – Hyperopt – Finding the optimal hyper parameters

This comes up as a very common question in ##machinelearning chat room – How do you chose the right parameters for your Neural Network model?  How do you chose how many hidden layers to use, or what your drop out parameter should be, or what the learning rate should be, and so on?

Or in other fields, say you’re doing an engineering or science experiment, how do you chose the right parameters to test?

This problem is called hyperoptimization.

One possible way is to do a “grid search”.  You keep all the parameters constant, then vary one by one and see what effect it has.  You adjust a different parameter, then repeat on the first parameter, trying to either to test out every possible combination.  The trouble is that the number of combinations blows dramatically, to the power of the number of parameters.

But never fear – there are smarter ways to do this, including random search, Bayesian Search, and Tree of Parzen Estimators (TPE).  The good thing is that you don’t actually need to know any of this is, in order to actually use it.

Let’s take a toy example, and since it’s valentines at the time of writing, let’s imagine instead our error space is:


Toy hyperparameter space function.  I spent way too long making this pretty 😉

Where the axes are filling in for our hyperparameters (e.g. learning rate, size of hidden layer), and the z value is our error value (e.g. percentage of cat photos incorrectly identified as dogs, etc.  Something that we want to minimize).

This diagram was generated with the notebook python code:

%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np

def heart(x, y):
    return -((x**2 + y**2 - 1)**3 - x**2 *y**3);

x = y = np.arange(-2, 2, 0.01)
X, Y = np.meshgrid(x,y)
Z = heart(X, Y)

def plotHeart():
CS = plt.contour(X, Y, Z, levels=[-0.5, -0.2, -0.1, 0, 1, 20],
            colors=["0.9", "0.9", "0.9", "red", "0.9", "0.9"])
plt.clabel(CS, inline=1, fontsize=10)

So, say you’ve written a Tensorflow, or keras, or scipy, etc program in Python.  For example, tensorflow/  Have a quick look at that – see how they’ve got various hyperparameters that need tuning.  There’s some complicated error function that we can’t directly visualize, in some high dimensional hyperparameter space.  And we’re trying to find the parameters that minimize this error.

Here’s my dummy program that will stand in for your complex number or cat recognition program:

class HyperParameters:
    def __init__(self):
        self.learning_rate = 1
        self.size_of_hidden_layer = 1

    def runAndGetError(self):
        """This function should call your tensorflow etc
        code which does the whole training procedure and returns
        the TRAINING error.  Lower is better!"""
        return heart(self.learning_rate, self.size_of_hidden_layer)
hyperparameters = HyperParameters();

def main():
    pass # This is where you do all the normal training code etc

if __name__ == "__main__":

Note that we’ve got a class for all our hyperparameters.  I recommend you having this!  Note also that we don’t actual run the training unless __name__ == “__main__”.  The point here is to make sure that our program runs as normal if we just run it directly, but we can also create a second program that uses hyperopt to tune this without having to touch our main program at all.  This sort of separation is really nice.

You can put discrete ints, category ints, real, bools, etc in the

Don’t put any hyperopt code in your main program!

Now, without touching your main program at all, we create our hyperopt tuner:

from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
import heart.hyperparameters as hyperparameters  # in the same folder
import numpy as np
import signal
import sys
import pickle

# The key in the space must match a variable name in HyperParameters
space = {
    'learning_rate': hp.uniform('learning_rate', -2, 2),
    'size_of_hidden_layer': hp.uniform('size_of_hidden_layer', -2, 2),

num_trials = 10
trials = Trials()

def summarizeTrials():
    print("Trials is:", np.sort(np.array([x for x in trials.losses() if x is not None])))

def main():
    global trials
    loadFromPickle = True
        if loadFromPickle:
            trials = pickle.load(open("hyperopt.p", "rb"))
        print("Starting new trials file")

    def objective(args):
        for key, value in args.items():
            if int(value) == value:
                value = int(value)
            setattr(hyperparameters, key, value)
        score = hyperparameters.runAndGetError()
        #print("Score:", score, "for", args)
        return {'loss': score, 'status': STATUS_OK}

    for i in range(num_trials):
        best = fmin(objective,
    print(i, "Best result was: ", best)

def save_trials():
    pickle.dump(trials, open("hyperopt.p", "wb"))

def signal_handler(signal, frame):
signal.signal(signal.SIGINT, signal_handler)

if __name__ == '__main__':

And now let’s plot this to see what is going on and see how hyperopt is going to sample from our space:

results = [ (x['misc']['vals']['learning_rate'][0], x['misc']['vals']['size_of_hidden_layer'][0]) for x in trials.trials if 'loss' in x['result']]
for i, result in enumerate(results):
    plt.annotate(i, xy = result, va="center", ha="center")

Hyperopt TPE testing 10 samples from our space.  First trial labeled ‘0’


These suggestions for trial evaluations look awful!  Why is not sampling close to the middle, where there are known good results?

What on earth is hyperopt doing?

It hasn’t even come close to the minimum points (at 0,0), while with just 10 samples, three of the samples are all bunched right next to each other.  On repeated runs, sometimes 50% of the points are right on the border too!

Even with 50 samples you end up with clumping, and even though it finds an excellent value at 2, it doesn’t improve on that even after 48 samples!

There has to be something going on here.

Strange Hyperopt behavior

Digging into the hyperopt code, turns out that there’s a:

_default_n_startup_jobs = 20

parameter.  So for the first 20 runs, the ‘TPE’ behavior isn’t used at all.  Digging around in the code reveals that this can be adjusted in our code by using:

from functools import partial
...    best = fmin(objective,...
                algo=partial(tpe.suggest, n_startup_jobs=1),

Which now produces:


Which is looking a lot nicer!

I’m not very happy with the initial sampling – just look at the position of the first 5 trials, but this is already a lot better.  But let’s try to crack open this black box a bit.


How it works.

  1. We split our experimental observations into a small number of Good Observations, and the rest as the Bad Observations.  They are split by how badly they did (e.g. percentage of images that they incorrectly classified).  The uses uses the least-bad  γ·√(num_observations) fraction for the good observations (obs_good) and the rest for the bad observations (obs_bad).This is done in the code with:
    obs_good, obs_bad = scope.ap_filter_trials(obs_x, obs_y, gamma)
  2. Pick random values of x. but making it more likely to pick values of x which are similar to the x values in the Good Observations that we’ve already seen.  So if x=3 and x=6 usually give a good result, we’re more likely to pick values close to that.  This is mathematically done by sampling from a Gaussian Mixture Model comprised of Gaussians with a mean of each good observation’s x value.

    This is done in code with: (I’ve renamed the variables)

    good_weights, good_mus, good_sigmas = adaptive_parzen_normal(
                         obs_good, prior_weight, prior_mu, prior_sigma)
    x_samples = scope.GMM1(weights, mus, sigmas)
  3. Calculate p(x | y is good) for our Good Samples and p(x | y is bad) for the Bad Samples.  This is done by getting the probability density of the Gaussian Mixture Model for each sample value.  As an implementation note, hyperopt actually calculates the log for each of these values (llik = log likelihood).

    Now, in code, this is actually done in a very complicated way that I decided to devote its own post to.

    So here instead is the equivalent of what is done in code:

    bad_weights, bad_mus, bad_sigmas = adaptive_parzen_normal(
                         obs_bad, prior_weight, prior_mu, prior_sigma)
    good_llik = GMM1_lpdf(x_samples, good_weights, good_mus, good_sigmas)
    bad_llik = GMM1_lpdf(x_samples, bad_weights, bad_mus, bad_sigmas)
  4. Find x value that maximizes the Expected Improvement: Chose the x value from our x_samples that maximizes:    p(x | y is good) / p(x | y is bad).   This is mathematically equal to choosing the x value that maximizes log(p(x | y is good) – log(p(x | y is good).

    This is done in code with:

     score = good_llik - bad_llik  # Element-wise vector subtraction
     best = np.argmax(score)
     memo = x_samples[best]
  5. Repeat for each hyperparameter.  Note that this assumes that every hyperparameter is completely independent from the others, which is obviously false.  To compensate, it weights older observations lower than newer ones, in the hope that it starts to converge all the parameters to good parameters that don’t change much.

Let’s put this all together, and graph what is going on:

def miscs_to_idxs_vals(miscs):
keys = list(miscs[0]['idxs'].keys())
idxs = dict([(k, []) for k in keys])
vals = dict([(k, []) for k in keys])
for misc in miscs:
for node_id in idxs:
t_idxs = misc['idxs'][node_id]
t_vals = misc['vals'][node_id]
assert len(t_idxs) == len(t_vals)
assert t_idxs == [] or t_idxs == [misc['tid']]
return idxs, vals

def get_lpdf_ast(parameter_name):
idxs, vals = miscs_to_idxs_vals(trials.miscs)
idxs = idxs[parameter_name]
vals = vals[parameter_name]
losses = trials.losses()

from hyperopt import pyll
from hyperopt.tpe import ap_filter_trials
from hyperopt.tpe import adaptive_parzen_samplers

qu = space[parameter_name].pos_args[0].pos_args[1] #pyll.scope.uniform(-2, 2)
fn = adaptive_parzen_samplers['uniform']
fn_kwargs = dict(size=(4,), rng=np.random)
s_below = pyll.Literal()
s_above = pyll.Literal()
b_args = [s_below, prior_weight] + qu.pos_args
b_post = fn(*b_args, **fn_kwargs)
a_args = [s_above, prior_weight] + qu.pos_args
a_post = fn(*a_args, **fn_kwargs)

start_val = space[parameter_name].pos_args[0].pos_args[1].pos_args[0].eval()
end_val = space[parameter_name].pos_args[0].pos_args[1].pos_args[1].eval()
x_samples = np.arange(start_val,end_val,0.1)

# print b_post
# print a_post
fn_lpdf = getattr(pyll.scope, + '_lpdf')
# calculate the llik of b_post under both distributions
a_kwargs = dict([(n, a) for n, a in a_post.named_args
if n not in ('rng', 'size')])
b_kwargs = dict([(n, a) for n, a in b_post.named_args
if n not in ('rng', 'size')])
below_llik = fn_lpdf(*([x_samples] + b_post.pos_args), **b_kwargs)
above_llik = fn_lpdf(*([x_samples] + a_post.pos_args), **a_kwargs)
new_node = pyll.scope.broadcast_best(x_samples, below_llik, above_llik)

return s_below, s_above, below_llik, above_llik, new_node

def get_lpdf(ii, s_below, s_above, below_llik, above_llik, new_node):
below, above = ap_filter_trials(idxs[:ii], vals[:ii], idxs[:ii], losses[:ii], gamma)
memo = {s_below: below, s_above: above}
bl, al, nv = pyll.rec_eval([below_llik, above_llik, new_node],
return bl, al, nv
parameter_name = "learning_rate"
s_below, s_above, below_llik, above_llik, new_node = get_lpdf_ast(parameter_name)
plt.text(0.06, 0.5, '⟵ Number of Trials', ha='center', va='center', rotation='vertical')
num_subplots = 10
subplot_i =1
for ii in range(1, len(idxs)-1, 1+int(len(idxs)/num_subplots)):
if ii > len(idxs):
bl, al, nv = get_lpdf(ii, s_below, s_above, below_llik, above_llik, new_node)
plt.subplot(num_subplots, 1, subplot_i)
subplot_i += 1
plt.plot(all_vals, bl, c='g')
plt.plot(all_vals, al, c='r')
plt.plot(all_vals, bl-al, c='k')
plt.axvline(nv[0], c='k')

Phew!   That took… quite a lot of code!  I wish hyperopt made this easier.  I should try to make a patch for hyperopt to make it easier.

Anyway, so here’s the result (drum roll).  First the loss function overlaid with where we are running our trials:


The graph on the bottom is for the x axis only variable of the heart graph.  The x axis is the same, and the y axis is the log(p(x|y is good) in green, log(p(x|y is bad)) in red, and log(p(x|y is good) – p(x|y is bad)) in black.  The vertical part indicates the largest value for the black line, which is the x value that will give the Maximum Improvement to the energy.  We have one graph after every 10 trials, with the bottom graph being after 100 trials (aka observations).

We can start to see what is going wrong here.  I don’t like that it’s not sampling the center enough, even though it can clearly see that the highest energy is there (the green peaks there).  I don’t like that it tends to clump together measurements.  I’m sure with a bit of thinking we can do better than this!

But I’ll think about it and leave that for another post.


More realistic example

Here’s a hyperopt ‘space’ that I used for Q-Learning, to give an idea of the options for a more realistic model:

# The key in the space must match a variable name in HyperParameters
space = {
    # Learning rate should be between 0.00001 and 1
        hp.loguniform('learning_rate', np.log(1e-5), np.log(1)),
        hp.choice('use_double_q_learning', [True, False]),
    'layer1_size': hp.quniform('layer1_size', 10, 100, 1),
    'layer2_size': hp.quniform('layer2_size', 10, 100, 1),
    'layer3_size': hp.quniform('layer3_size', 10, 100, 1),
    'future_discount_max': hp.uniform('future_discount_max', 0.5, 0.99),
        hp.loguniform('future_discount_increment', np.log(0.001), np.log(0.1)),
    'recall_memory_size': hp.quniform('recall_memory_size', 1, 100, 1),
    # This doesn't tune very well:
    # 'random_action_epsilon':
    #    hp.loguniform('random_action_epsilon', np.log(0.0001), np.log(0.1)),
       hp.quniform('recall_memory_num_experiences_per_recall', 10, 2000, 1),
    'num_epochs': hp.quniform('num_epochs', 1, 10, 1),