Learning Machine Learning

Heh, funny title.

I became interested in machine learning working for Nokia.  I worked on Nokia’s Z Launcher application for Android.  You can scribble a letter (or multiple), and it would recognize it and search for it.  The app is available for download in the Play Store.


I worked on the Nokia Z Launcher’s handwriting recognition

Specifically I was tasked with optimizing the speed of the recognition.  I don’t know if I can state any specifics on how the character recognition was done, but I will say that I managed to increase the speed of the recognition a hundred fold.

But this recognition was actually a relatively simple task, compared to modern day deep neural networks, but it really whet my appetite to understand more.

When Alpha Go beat Lee Sedol, I knew that I simply must understand Deep Neural Networks.

Below is my journey in understanding, along with my reflective thoughts:

  1. I started off naively implementing an LSTM neural network without any training.
    I wanted to have a crude appreciation of the problems before I read about the solutions.  My results are documented in my previous post here, but looking back it’s quite embarrassing to read.  I don’t regret doing this at all however.
  2. Next I did the Andrew Ng Coursera Machine Learning course.
    This is an 11 week course in the fundamentals.  I completed it along with all the coursework, but by the end I felt that my knowledge was about 20 years out of date.  It was really nice to get the fundamentals, but none of the modern discoveries were discussed at all.  Nothing about LSTM, or Markov Chains, or dropouts, etc.
    The exercises are also all done in Matlab/Octave, which has fallen out of favour and uses a lot of support code.  I certainly didn’t feel comfortable in implementing a neural network system from scratch after watching the course.


    Passed the Coursera Machine learning course with 97.6% score.

    The lecturer, Andrew Ng, was absolutely awesome.  My complaints, really, boil down to that I wish the course was twice as long and that I could learn more from him!  I now help out in a machine learning chat group and find that most of the questions that people ask about TensorFlow, Theano etc are actually basics that are answered very well by Andrew Ng’s course.  I constantly direct people to the course.

  3. Next, Reinforcement Learning.  I did a 4 week Udacity course UD820.
    This was a done by a pair of teachers that are constantly joking with each other.  I first I thought it would annoy me, but I actually really enjoyed the course – they work really well together.  They are a lot more experienced and knowledgeable than they pretend to be.  (They take it in turns to be ignorant, to play the role of a student).
  4. I really wanted to learn about TensorFlow, so I did a 4 week Udacity Course UD730
    Again I did all the coursework for it.  I thought the course was pretty good, but it really annoyed me that each video was about 2 minutes long!  Resulting in a 30 second pause every 2 minutes while it loaded up the next video.  Most frustrating.
  5. At this point, I started reading papers and joined up for the Visual Doom AI competition.
    I have come quite far in my own Visual Doom AI implementation, but the vast majority of the work is the ‘setup’ required.  For example, I had to fix bugs in their doom engine port to get the built-in AI to work.  And it was a fair amount of work to get the game to run nicely with TensorFlow, with mini-batch training, testing and verification stages.
    I believe I understand how to properly implement a good AI for this, with the key coming from guided policy search, in a method similar to that pioneered by google for robotic control.  (Link is to a keynote at the International Conference on Learning Representations 2016).   The idea is to hack the engine to give me accurate internal data of the positions of enemies, walls, health, etc that I can use to train a very simple ‘teacher’.  Then use that teacher to supervise a neural network that has only visual information, thus allowing us to train a deep neural network with back-propagation.  By alternating between teacher and student, we can converge upon perfect solutions.  I hope to write more about this in a proper blog post.
  6. The International Conference on Learning Representations (ICLR) 2016
    Videos were absolutely fascinating, and were surprisingly easy to follow with the above preparation.
  7. I listened to the entire past two years of the The Talking Machines podcast.  It highlighted many areas that I was completely unfamiliar with, and highlighted many things that I knew about but just didn’t realise were important.
  8. I did the Hinton Coursera course on Neural Networks for Machine Learning, which perfectly complemented the Andrew Ng Coursera Course.  I recommend these two courses the most, for the foundations.  It is about 5 years out of date, but is all about the fundamentals.
  9. I did the Computational Neuroscience course.  The first half was interesting and was about, well, neuroscience.  But the math lectures were in a slow monotonic tone that really put me straight to sleep.  The second half of the course was just a copy of the Andrew Ng course (they even said so), so I just skipped all the lectures and did the exams with no problems.  I really liked that they let you do the homework in python.  It is easier in matlab, but I really wanted to improve my python datascience skills.  The exams took way way longer than their predicted times.  They would have 20 questions, requiring you to download, run and modify 4 programs, and say that you should be able to do it in an hour!
  10. I completed the IBM Artificial Intelligence class.  This is the second most expensive AI class that I’ve done, at $1600 plus $50 for the book.   I was really not at all impressed by it.  Most of the videos are 1 minute long, or less, each.  Which means that I’m spending half the time waiting for the next video to load up.  The main lecturer gets on my own nerves – he wears a brightly colored Google Glass for no apparent reason other than to show it off, and speaks in the most patronizing way.  You get praised continually for signing up to the course at the start.  It’s overly specialized.  You use their specific libraries which aren’t at all production ready, and you use toy data with a tiny number of training examples.  (e.g. trying to train for the gesture ‘toy’ with a single training example!).  Contrast this against the Google Self Driving course:
  11. The Google Self Driving course, which is $2400.  This is much better than the IBM course.  The main difference is that they’ve made the theme self driving cars, but you do it all in TensorFlow and you learn generic techniques that could be applied to any machine learning field.  You quickly produce code that could be easily made production ready.  You work with large realistic data, with large realistic neural networks, and they teach you use the Amazon AWS servers to train the data with.  The result is code that can be (and literally is!) deployed to a real car.

Tensorflow for Neurobiologists

I couldn’t find anyone else that has done this, so I made this really quick guide.  This uses tensorflow which is a complete overkill for this specific problem, but I figure that a simple example is much easier to follow.

Install and run python3 notebook, and tensorflow.  In Linux, as a user without using sudo:

$ pip3 install --upgrade --user ipython[all] tensorflow matplotlib
$ ipython3  notebook

Then in the notebook window, do New->Python 3

Here’s an example I made earlier. You can download the latest version on github here: https://github.com/johnflux/Spike-Triggered-Average-in-TensorFlow

Spike Triggered Average in TensorFlow

The data is an experimentally recorded set of spikes recorded from the famous H1 motion-sensitive neuron of the fly (Calliphora vicina) from the lab of Dr Robert de Ruyter van Steveninck.

This is a complete rewrite of non-tensorflow code in the Coursera course Computational Neuroscience by University of Washington. I am thoroughly enjoying this course!

Here we use TensorFlow to find out how the neuron is reacting to the data, to see what causes the neuron to trigger.

%matplotlib inline
import pickle
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()

FILENAME = 'data.pickle'

with open(FILENAME, 'rb') as f:
    data = pickle.load(f)

stim = tf.constant(data['stim'])
rho = tf.constant(data['rho'])
sampling_period = 2 # The data was sampled at 500hz = 2ms
window_size = 150 # Let's use a 300ms / sampling_period sliding window

We now have our data loaded into tensorflow as a constant, which means that we can easily ‘run’ our tensorflow graph. For example, let’s examine stim and rho:

print("Spike-train time-series =", rho.eval(),
      "\nStimulus time-series     =", stim.eval())
Spike-train time-series = [0 0 0 ..., 0 0 0] 
Stimulus time-series    = [-111.94824219  -81.80664062 
    10.21972656 ...,  9.78515625 24.11132812 50.25390625]

rho is an binary array where a 1 indicates a spike. Let’s turn that into an array of indices of where the value is 1, but ignoring the first window_size elements.

Note: We can use the [] and + operations on a tensorflow variable, and it correctly adds those operations to the graph. This is equivalent to using the tf.slice and tf.add operations.

spike_times = tf.where(tf.not_equal(rho[window_size:-1], 0)) + window_size
print("Time indices where there is a spike:\n", spike_times.eval())
Time indices where there is a spike:
 [[   158]
 [   160]
 [   162]
def getStimWindow(index):
    i = tf.cast(index, tf.int32)
    return stim[i-window_size+1:i+1]
stim_windows = tf.map_fn(lambda x: getStimWindow(x[0]), spike_times, dtype=tf.float64)
spike_triggered_average = tf.reduce_mean(stim_windows, 0).eval()
print("Spike triggered averaged is:", spike_triggered_average[0:5], "(truncated)")
Spike triggered averaged is: [-0.33083048 -0.29083503 -0.23076012 -0.24636984 -0.10962767] (truncated)

Now let’s plot this!

time = (np.arange(-window_size, 0) + 1) * sampling_period
plt.plot(time, spike_triggered_average)
plt.xlabel('Time (ms)')
plt.title('Spike-Triggered Average')



It’s… beautiful!

What we are looking at here, is that we’ve discovered that our neuron is doing a leaky integration of the stimulus. And when that integration adds up to a certain value, it triggers.

Do see the github repo for full source: https://github.com/johnflux/Spike-Triggered-Average-in-TensorFlow

Update: I was curious how much noise there was. There’s a plot with 1 standard deviation in light blue:

mean, var = tf.nn.moments(stim_windows,axes=[0])
plt.errorbar(time, spike_triggered_average, yerr=tf.sqrt(var).eval(), ecolor="#0000ff33")


Yikes!  This is why the input signal MUST be Gaussian, and why we need lots of data to average over.  For this, we’re averaging over 53583 thousand windows.

Coursera Neural Networks for Machine Learning

I’ve completed another 4 month Coursera Machine Learning course. The first was the Andrew NG’s Machine Learning course, and the second is Geoffrey Hinton’s Neural Networks for Machine Learning course.  I also completed a 2 month Coursera course in Computation Neuroscience.


Course certificate

Both courses use Matlab, and this second one was significantly more work and buggier. Several of the exam questions had key bits of information accidentally left out and referred to images that didn’t exist. Only by referring to the user forums could the key bits of information be found out. However I was one of the first people to do this course, so bugs are to be expected.

Technical issues aside, I enjoyed the course a lot. It’s approximately 5 years behind the current state of the field, but the information is still very relevant.

Edit: Update on 2017/2/5 – I also did the 2 month Computational Neuroscience course:


I enjoyed the first half of the course, which was all new and interesting to me.  But the second half was just a basic version of the Andrew Ng course.


Dilation Animation

There is a neat git repository that generates animations of various different settings. But unfortunately it didn’t show dilation – which is one of my favourite Convolutional Neural Network techniques.

Edit: My changes are now merged

So I added support.

And here’s what my patch generates:  The bottom layer is the input layer, the top is the output layer.


No stride, no padding, dilation of 2, kernel size of 3

The idea with dilation is that at each layer you double the dilation size.  That way your receptive field grows exponentially with each layer, allowing you to deal with large images without pooling and thus have pixel sharpness.

Second order back propagation

This is a random idea that I’ve been thinking about. A reader messaged me to say that this look similar to online l-bgfs. To my inexperienced eyes, I can’t see this myself, but I’m still somewhat a beginner.


Yes, I chose a cat example simply so that I had an excuse to add cats to my blog.

Say we are training a neural network to take images of animals and classify the image as being an image of a cat or not a cat.

You would train the network to output, say, y_{cat} = 1 if the image is that of a cat.

To do so, we can gather some training data (images labelled by humans), and for each image we see what our network predicts (e.g. “I’m 40% sure it’s a cat”).  We compare that against what a human says (“I’m 100% sure it’s a cat”)  find the squared error (“We’re off by 0.6, so our squared error is 0.6^2”) and adjust each parameter, w_i, in the network so that it slightly decreases the error (\Delta w_i  = -\alpha \partial E/\partial w_i).  And then repeat.

It’s this adjustment of each parameter that I want to rethink.  The above procedure is Stochastic Gradient Descent (SGD) – we adjust w_i to reduce the error for our test set (I’m glossing over overfitting, minibatches, etc).

Key Idea

This means that we are also trying to look for a local minimum. i.e. that once trained, we want the property that if we varied any of the parameters w_i by a small amount then it should increase the expected squared error E

My idea is to encode this into the SGD update. To find a local minima for a particular test image we want:

\dfrac{\partial y_{cat}}{\partial w_i} = 0

\dfrac{\partial^2y_{cat}}{\partial w_i^2}  < 0 (or if it equals 0, we need to consider the third differential etc).

Let’s concentrate on just the first criteria for the moment. Since we’ve already used the letter E to mean the half squared error of y, we’ll use F to be the half squared error of \dfrac{\partial y_{cat}}{\partial w_i}.

So we want to minimize the half squared error F:

F = \dfrac{1}{2}\left(\dfrac{\partial y_{cat}}{\partial w_i}\right)^2

So to minimize we need the gradient of this error:

\dfrac{\partial F}{\partial w_i} = \dfrac{1}{2} \dfrac{\partial}{\partial w_i} \left(\dfrac{\partial y_{cat}}{\partial w_i}\right)^2 = 0

Applying the chain rule:

\dfrac{\partial F}{\partial w_i} = \dfrac{\partial y_{cat}}{\partial w_i} \dfrac{\partial^2 y_{cat}}{\partial w_i^2} = 0

SGD update rule

And so we can modify our SGD update rule to:

\Delta w_i = -\alpha \partial E/\partial w_i - \beta \dfrac{\partial y_{cat}}{\partial w_i} \dfrac{\partial^2 y_{cat}}{\partial w_i^2}

Where \alpha and \beta are learning rate hyperparameters.


We finished with a new SGD update rule. I have no idea if this actually will be any better, and the only way to find out is to actually test. This is left as an exercise for the reader 😀

Best-fitting to a Cumulative distribution function in python TensorFlow

I wanted to find a best fit curve for some data points when I know that the true curve that I’m predicting is a parameter free Cumulative Distribution Function.  I could just do a linear regression on the points, but the resulting function, y might not have the properties that we desire from a CDF, such as:

  • Monotonically increasing
  • y(x) \xrightarrow{x\to\infty} 1      i.e. That it tends to 1 as x approaches positive infinity
  • y(x) \xrightarrow{x\to-\infty} 0      i.e. That it tends to 0 as x approaches negative infinity

First, I take the second two points.  To deal with this, I use the sigmoid function to create a new parameter, x’:

x'(x) = \textrm{sigmoid}(x)

This has the nice property that x'(x) \xrightarrow{x\to\infty} 1 and x'(x) \xrightarrow{x\to-\infty} 0

So now we can find a best fit polynomial on x'.


y(x') = a + bx' + cx'^2 + dx'^3 + .. + zx'^n

But with the boundary conditions that:

y(0) = 0 and y(1) = 1

Which, solving for the boundary conditions, gives us:

y(0) = 0 = a
b = (1-c-d-e-...-z)

Which simplifies our function to:

y(x') = (1-c-d-e-..-z)x' + cx'^2 + dx'^3 + ex'^4 .. + zx'^n

Implementing this in tensorflow using stochastic gradient descent (code is below) we get:


20 data samples


100 data samples

(The graph title equation is wrong.  It should be   x' = sigmoid(x) then y(x') = ax' +...  I was too lazy to update the graphs sorry)

Unfortunately this curve doesn’t have one main property that we’d like – it’s not monotonically increasing – it goes above 1. I thought about it for a few hours and couldn’t think of a nice solution.

I’m sure all of this has been solved properly before, but with a quick google I couldn’t find anything.

Edit: I’m updating this 12 days later. I have been thinking in the background about how to enforce that I want my function to be monotonic. (i.e. always increasing/decreasing) I was certain that I was just being stupid and that there was a simple answer. But by complete coincidence, I watched the youtube video Breakthroughs in Machine Learning – Google I/O 2016 in which the last speaker mentioned this restriction.  It turns out that this a very difficult problem – with the first solutions having a time complexity of O(N^4):

I didn’t understand what exactly google’s solution for this is, but they reference their paper on it:  Monotonic Calibrated Interpolated Look-up Tables, JMLR 2016.  It seems that I have some light reading to do!

#!/usr/bin/env python

import tensorflow as tf

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from scipy.stats import norm

# Create some random data as a binned normal function
n_observations = 20
fig, ax = plt.subplots(1, 1)
xs = np.linspace(-3, 3, n_observations)

ys = norm.pdf(xs) * (1 + np.random.uniform(-1, 1, n_observations))

ys = np.cumsum(ys)

ax.scatter(xs, ys)

highest_order_polynomial = 3

# We have our data now, so on with the tensorflow
# Setup the model graph

# Our input is an arbitrary number of data points (That's what the 'None dimension means)
# and each input has just a single value which is a float

X = tf.placeholder(tf.float32, [None])
Y = tf.placeholder(tf.float32, [None])

# Now, we know data fits a CDF function, so we know that
# ys(-inf) = 0   and ys(+inf) = 1

# So let's set:

X2 = tf.sigmoid(X)

# So now X2 is between [0,1]

# Let's now fit a polynomial like:
#   Y = a + b*(X2) + c*(X2)^2 + d*(X2)^3 + .. + z(X2)^(highest_order_polynomial)
# to those points.  But we know that since it's a CDF:
#   Y(0) = 0 and y(1) = 1
# So solving for this:
#   Y(0) = 0 = a
#   b = (1-c-d-e-...-z)
# This simplifies our function to:
#   y = (1-c-d-e-..-z)x + cx^2 + dx^3 + ex^4 .. + zx^(highest_order_polynomial)
# i.e. we have highest_order_polynomial-2  number of weights
W = tf.Variable(tf.zeros([highest_order_polynomial-1]))

# Now set up our equation:
Y_pred = tf.Variable(0.0)
b = (1 - tf.reduce_sum(W))
Y_pred = tf.add(Y_pred, b * tf.sigmoid(X2))

for n in xrange(2, highest_order_polynomial+1):
    Y_pred = tf.add(Y_pred, tf.mul(tf.pow(X2, n), W[n-2]))

# Loss function measure the distance between our observations
# and predictions and average over them.
cost = tf.reduce_sum(tf.pow(Y_pred - Y, 2)) / (n_observations - 1)

# Regularization if we want it.  Only needed if we have a large value for the highest_order_polynomial and are worried about overfitting
# It's also not clear to me if regularization is going to be doing what we want here
#cost = tf.add(cost, regularization_value * (b + tf.reduce_sum(W)))

# Use stochastic gradient descent to optimize W
# using adaptive learning rate
optimizer = tf.train.AdagradOptimizer(100.0).minimize(cost)

n_epochs = 10000
with tf.Session() as sess:
    # Here we tell tensorflow that we want to initialize all
    # the variables in the graph so we can use them

    # Fit all training data
    prev_training_cost = 0.0
    for epoch_i in range(n_epochs):
        sess.run(optimizer, feed_dict={X: xs, Y: ys})

        training_cost = sess.run(
            cost, feed_dict={X: xs, Y: ys})

        if epoch_i % 100 == 0 or epoch_i == n_epochs-1:

        # Allow the training to quit if we've reached a minimum
        if np.abs(prev_training_cost - training_cost) &amp;lt; 0.0000001:
        prev_training_cost = training_cost
    print &quot;Training cost: &quot;, training_cost, &quot; after epoch:&quot;, epoch_i
    w = W.eval()
    b = 1 - np.sum(w)
    equation = &quot;x' = sigmoid(x), y = &quot;+str(b)+&quot;x' +&quot; + (&quot; + &quot;.join(&quot;{}x'^{}&quot;.format(val, idx) for idx, val in enumerate(w, start=2))) + &quot;)&quot;;
    print &quot;For function:&quot;, equation
    pred_values = Y_pred.eval(feed_dict={X: xs}, session=sess)
    ax.plot(xs, pred_values, 'r')

#ax.set_ylim([-3, 3])

Logistic Regression and Regularization

Tons has been written about regularization, but I wanted to see it for myself to try to get an intuitive feel for it.

I loaded a dataset from google into python (a set of images of letters) and implemented a double for-loop to run a logistic regression with different test data sizes, and different regularization parameters.  (The value shown in the graph is actually 1/regularization ).


def doLogisticRegression(trainsize, regularizer):
    fitmodel = linear_model.LogisticRegression(C=regularizer)
    train_datasubset = train_dataset[0:trainsize,:,:].reshape(trainsize, -1)
    x = fitmodel.fit(train_dataset[0:trainsize,:,:].reshape(trainsize, -1),
    return [fitmodel.score(train_datasubset, train_labels[0:trainsize]),
            fitmodel.score(valid_dataset.reshape(valid_dataset.shape[0], -1), valid_labels)
trainsizes = [50,200,300,400,500,600,700,800,900,1000,2000,3000,4000,5000, 10000, 50000, 100000, 200000];
color = 0
plots = []
for regularizer in [1, 0.1, 0.01, 0.001]:
    results = np.array([doLogisticRegression(x, regularizer) for x in trainsizes])
    dashedplot = plt.plot(trainsizes, results[:,1], '--', label=(&amp;quot;r:&amp;quot; + str(regularizer)))
    plt.plot(trainsizes, results[:,0], c=dashedplot[0].get_color(), label=(&amp;quot;r:&amp;quot; + str(regularizer)))
plt.legend(loc='best', handles=plots)

The result is very interesting. The solid line is the training set accuracy, and the dashed line is the validation set accuracy. The vertical axis is the accuracy rate (percentage of images recognized as the correct letter) and the horizontal axis is the number of training examples.

graph of accuracy against training set for logistic regression

Image to letter recognition accuracy against training size, for various values of r = 1/regularization_factor.  Solid line is training set accuracy, dotted line is validation set accuracy.

First, I find it fascinating that purely a logistic regression can produce an accuracy of recognizing letters at 82%. If you added in spell checking, and ran this over an image, you could probably get a pretty decent OCR system, from purely an logistical regression.

Second, it’s interesting to see the effect of the regularization term. At less than about 500 training examples, the regularization term only hurts the algorithm. (A value of 1 means no regularization). At about 500 training examples though, the strong regularization (really helps). As the number of training examples increases, regularization makes less and less of an impact, and everything converges at around 200,000 training samples.

It’s quite clear at this point, at 200,000 training samples, that we are unlikely to get more improvements with more training samples.

A good rule of thumb that I’ve read is that you need approximately 50 training samples per feature. Since we have 28×28 = 784 features, this would be at 40,000 training samples which is actually only a couple of percent from our peak performance at 200,000 training samples (which is 2000000/784=2551 training samples per feature).

At this point, we could state fairly confidently that we need to improve the model if we want to improve performance.

Stochastic Gradient Descent

I reran with the same data but with stochastic gradient descent (batch size 128) and no regularization.  The accuracy (after 9000 runs) on the validation set was about the same as the best case with the logistic regression (81%), but took only a fraction of the time to run.  It took just a few minutes to run, verses a few hours for the logistic regression.

Stochastic Gradient Descent with 1 hidden layer

I added a hidden layer (of size 1024) and reran. The accuracy was only marginally better (84%).  Doubling the number of runs increased this accuracy to 86%.



Exploiting internal neural network symmetries

Neural networks have many internal symmetries.  A symmetry is an operation that you can apply to the weights such that for all inputs the outputs are the same.

The trivial symmetry is the identity. If you multiply all the weights in a neural network by 1, then the output remains unchanged.

We can also move nodes around by swapping the weight in any hidden layer.  For example:


But what other symmetries are there?  How is this useful?

Well, by itself, it’s not.  But let’s get hand-wavy.  Let’s say that we are using a ReLU activation function.  What this means is that at our hidden node, if the final value is greater than 0, then the output at that node is the same value.  if it’s negative, then the output is 0.

Let’s hand wave like crazy, and consider only the case where the node is activated (i.e the total value going into node is >=0).  The output is equal to the input, and in this region it is linear.

So, ignoring the non-linear case, we can find an interesting symmetry.  We can project to the basis   [1,1]/\sqrt{2} and [1,-1]/\sqrt{2}.  Or in pictures:


So, ignoring the non-linearity at the hidden layer, we find that we can actually weights in a more interesting fashion.

Now, what can we do with this?

Honestly, I’m not sure.  One idea I had is if you’ve optimized a neural net and reached the peak that your training set can give you, you could apply this swap operation to random nodes and retrain and see if it helps.  The effect is that you’re leaving the weights the same, but changing when the node activates.  Hopefully I’ll have a better idea later.

Meta Neural Network

The idea is to create a large neural network which itself generates other neural networks.  The goal is to let the neural network find a way to just create new networks without any training needed.

I have two very different ideas on how to do this.

The first is not so practical to implement for any real world problems, but would be interesting to test on toy problems:

meta neural net idea

Now, this idea has many drawbacks.  One problem is that when training the “Author” network in the first place, you typically have a fixed size input and so the training input will need to have a fixed size.

Since the input needs to be a fixed size, this means that you are limited to fixed size input and output for all the “Baby” Neural Networks that are generated, and that you must always have the same number of training samples.

Likewise, since the output is a fixed size, the weights has to be a fixed size, and so the number of layers, size of the layers, architecture, and so on has to be fixed for all possible Baby networks generated for a given Author.

Long Short Term Memory

LSTM Neural Networks are interesting.  There’s plenty of literature on the web about them, so I thought I’d cut to the chase and show how to implement a toy game.

In this game, we want to have a mouse and a cat in a room.  The cat tries to eat the mouse, and the mouse tries to avoid being eaten.


Cat & Mouse start randomly.  Mouse is taught to run away from the stationary cat.

We will give the mouse and cat both an LSTM Neural Network brain, and let them fight it out.

This game will be implemented with a straight forward LSTM neural network.  This means that it is supervised which means that our mouse and cat brains can only learn by example.

This is really important – it means that we can’t just have our mouse and cat learn for themselves.  We could do that if we used genetic algorithms to develop the neural networks, but that’s not how we’re doing it here.

With all that said, let’s just dive straight in!

First, lets set up a sigmoid function and its derivative:


import copy, numpy as np
import math

# compute sigmoid nonlinearity
def sigmoid(x):
    output = 1/(1+np.exp(-x))
    return output

# convert output of sigmoid function to its derivative
def sigmoid_output_to_derivative(output):
    return output*(1-output)
class LSTM_Brain:
    def __init__(self):
        self.alpha = 0.1         # Training rate
        self.input_dim = 4       # Number of parameters for the input.  We input the position of the cat and the mouse.
        self.hidden_dim = 16     # Size of a brain, so to speak.  Feel free to vary
        self.output_dim = 1      # Number of outputs.  We output just the preferred direction, so just one.

    def setupTraining(self):
        self.input_values = list()
        self.layer_2_deltas = list()
        self.layer_1_values = list()
        self.total_abs_error = 0  # Store the error and smoothed error just for debugging information

    def feedInputForGetOutput(self, inputData):
        # inputData has only been tested for values between 0 and 1.  Not sure how it works otherwise
        assert len(inputData) == self.input_dim
        X = np.array([inputData])
        # Feed the input to our 'brain', and return the output (e.g. which direction it thinks we should move in)
        # hidden layer (input ~+ prev_hidden)
        layer_1 = sigmoid(np.dot(X,self.synapse_0) + np.dot(self.layer_1_values[-1],self.synapse_h))
        # output layer (new action)
        self.layer_2 = sigmoid(np.dot(layer_1,self.synapse_1))
        # store hidden layer so we can use it in the next timestep
        return self.layer_2[0]

    def teachError(self, error):
        self.total_abs_error += sum(np.abs(error))  # We store this just for debugging
        self.layer_2_error = np.array([error]).T

    def brain_wipe(self):
        # initialize neural network weights.  Forget everything we've learned
        self.synapse_0 = 2*np.random.random((self.input_dim,self.hidden_dim)) - 1
        self.synapse_1 = 2*np.random.random((self.hidden_dim,self.output_dim)) - 1
        self.synapse_h = 2*np.random.random((self.hidden_dim,self.hidden_dim)) - 1
        self.total_smoothed_abs_error = None

    def learnFromGame(self):
        future_layer_1_delta = np.zeros(self.hidden_dim)
        synapse_0_update = np.zeros_like(self.synapse_0)
        synapse_1_update = np.zeros_like(self.synapse_1)
        synapse_h_update = np.zeros_like(self.synapse_h)

        # Now, learn from the game
        for time_tick in range(len(self.input_values)):
            X = self.input_values[-time_tick-1]

            layer_1 = self.layer_1_values[-time_tick-1]
            prev_layer_1 = self.layer_1_values[-time_tick-2]

            # error at output layer
            layer_2_delta = self.layer_2_deltas[-time_tick-1]
            # error at hidden layer
            layer_1_delta = (future_layer_1_delta.dot(self.synapse_h.T) + layer_2_delta.dot(self.synapse_1.T)) * sigmoid_output_to_derivative(layer_1)

            # let's update all our weights so we can try again
            synapse_1_update += np.atleast_2d(layer_1).T.dot(layer_2_delta)
            synapse_h_update += np.atleast_2d(prev_layer_1).T.dot(layer_1_delta)
            synapse_0_update += X.T.dot(layer_1_delta)

            future_layer_1_delta = layer_1_delta

        self.synapse_0 += synapse_0_update * self.alpha
        self.synapse_1 += synapse_1_update * self.alpha
        self.synapse_h += synapse_h_update * self.alpha

        if self.total_smoothed_abs_error is None:
            self.total_smoothed_abs_error = self.total_abs_error
            self.total_smoothed_abs_error = self.total_smoothed_abs_error * 0.999 + 0.001* self.total_abs_error  # Smooth it - for debugging only

At this point, we’ve got a basic LSTM.

Let’s now give it a world and put this brain in a mouse, and the mouse in a world.

class World:
    # For directions, 0,0 is in the bottom left.  Up is positive y.  Direction is between 0 and 1
    Up = 0.125
    Right = 0.375
    Down = 0.625
    Left = 0.875

    def __init__(self):
        self.map_width = 6
        self.map_height = 6

    def resetWorld(self, cat_x, cat_y, mouse_x, mouse_y):
        self.cat_x = cat_x
        self.cat_y = cat_y
        self.mouse_x = mouse_x
        self.mouse_y = mouse_y

    def distance_from_cat_x(self):
        return self.mouse_x - self.cat_x
    def distance_from_cat_y(self):
        return self.mouse_y - self.cat_y

    def moveMouse(self, time_tick, mouse_movement):
        if mouse_movement &lt; (self.Up + 0.125):   # 0 to 0.25 is up.  midpoint is 0.125
            self.mouse_y += 1
        elif mouse_movement &lt; (self.Right + 0.125):  # 0.25 to 0.5 is right.  midpoint is 0.375
            self.mouse_x += 1
        elif mouse_movement &lt; (self.Down + 0.125): # 0.5 to 0.75 is down.  midpoint is 0.625
            self.mouse_y -= 1
            self.mouse_x -= 1   # 0.75 to 1 is left.  midpoint is 0.875
        self.mouse_x = np.clip(self.mouse_x, 0, self.map_width-1)
        self.mouse_y = np.clip(self.mouse_y, 0, self.map_height-1)

class GameWithStupidComputerTeacher:
    def __init__(self):
        self.debug = False     # Whether to print out debug information
        self.game_length = 10  # Number of moves a game should last for.  We make this fixed, but we could make it variable
        self.world = World()   # The world to run in

    def runGame(self, mouse_brain):
        # Start a new game.  We have a cat and a mouse
        # at their starting position


        for time_tick in range(self.game_length):
            mouse_movement = mouse_brain.feedInputForGetOutput([(self.world.cat_x)/float(self.world.map_width-1),

            ideal_direction = self.stupidTeacherForMouseGetIdealDirection()
            mouse_brain.teachError([ideal_direction - mouse_movement])

            # This is the direction the mouse brain says it wants to move in.  We treat it as a clockwise compass reading
            # Move the mouse accordingly
            #self.world.moveMouse(time_tick, ideal_direction)
            self.world.moveMouse(time_tick, mouse_movement)

        # Game is completed.  Learn from what we've been taught.

    def stupidTeacherForMouseGetIdealDirection(self):
        # Play the role of a stupid teacher for the mouse, and direct the mouse to just run in the opposite direction from the cat
        ideal_direction = 0
        if self.world.distance_from_cat_x() &gt;= 0:
            # We could move right.  But check if moving up or down makes more sense
            if self.world.distance_from_cat_y() &gt; self.world.distance_from_cat_x() and self.world.mouse_y != self.world.map_height - 1:
                ideal_direction = self.world.Up
            elif -self.world.distance_from_cat_y() &gt; self.world.distance_from_cat_x() and self.world.mouse_y != 0:
                ideal_direction = self.world.Down
            elif self.world.distance_from_cat_y() &gt; 0 and self.world.mouse_x == self.world.map_width - 1:
                ideal_direction = self.world.Up
            elif self.world.distance_from_cat_y() &lt; 0 and self.world.mouse_x == self.world.map_width - 1:
                ideal_direction = self.world.Down
                ideal_direction = self.world.Right
            if self.world.distance_from_cat_y() &gt; -self.world.distance_from_cat_x() and self.world.mouse_y != self.world.map_height - 1:
                ideal_direction = self.world.Up # up is the best way!
            elif -self.world.distance_from_cat_y() &gt; -self.world.distance_from_cat_x() and self.world.mouse_y != 0:
                ideal_direction = self.world.Down # down is the best way!
            elif self.world.distance_from_cat_y() &gt;= 0 and self.world.mouse_x == 0:
                ideal_direction = self.world.Up # can't go left, so go up!
            elif self.world.distance_from_cat_y() &lt; 0 and self.world.mouse_x == 0:
                ideal_direction = self.world.Down # can't go left, so go down!
                ideal_direction = self.world.Left # left is the best way!
        if self.debug:
            print &quot;cat:&quot;, self.world.cat_x, self.world.cat_y, &quot;mouse:&quot;, self.world.mouse_x, self.world.mouse_y, &quot;distance:&quot;, self.world.distance_from_cat_x(), self.world.distance_from_cat_y(), &quot;ideal: &quot;, ideal_direction
        return ideal_direction

game = GameWithStupidComputerTeacher()
mouse_brain = LSTM_Brain()

print_csv = True
print_progress = not print_csv and True

if not print_csv:
    graphics.init(game.world.map_width, game.world.map_height)
finish = False

for j in range(10000001):
    game.debug = (j == 10000000) and not print_csv


    if mouse_brain.total_smoothed_abs_error*4 &lt; 1.2:
        finish = True # An average of making 1 error per game

    # print out progress
    if (print_progress and j % 1000 == 0) or finish:
        mouse_x = [int(x[0][2]*(game.world.map_width-1)) for x in mouse_brain.input_values]
        mouse_y = [int(y[0][3]*(game.world.map_height-1)) for y in mouse_brain.input_values]
        cat_x = [game.world.cat_x] * len(mouse_brain.input_values)
        cat_y = [game.world.cat_y] * len(mouse_brain.input_values)
        print &quot;Game: &quot;, j
        print &quot;cat x:  &quot;, cat_x
        print &quot;cat y:  &quot;, cat_y
        print &quot;mouse x:&quot;, mouse_x, &quot;     Errors:&quot;, '%.1f'%(mouse_brain.total_abs_error*4), &quot;Smoothed Errors:&quot;, '%.1f'%(mouse_brain.total_smoothed_abs_error*4)
        print &quot;mouse y:&quot;, mouse_y
        graphics.updateGraphics(cat_x, cat_y, mouse_x, mouse_y)

    if print_csv and j % 1000 == 0:
        print j, &quot;,&quot;, mouse_brain.total_smoothed_abs_error*4

    if finish:

At the end of this, we have a mouse that learns to run away from a stationary cat.
The teacher is teaching the mouse to:

1. Move away from the cat, in the direction that you are already furthest in.
2. Unless there you come to a wall. In which case move along the wall, away from the cat.

For a toy example, this is relatively challenging since the mouse neural net is being fed the absolute position of the mouse and the cat,
and so needs to learn to take the difference between the positions, judge which is largest, and modify the behaviour if near a wall.

It takes approximately 1 million games, with each game being 10 moves, for the mouse to learn to follow the teachers’ instruction with only 1.4 mistakes per game (averaged over 1000 games). Reducing this to 1 mistake per game took a further 2 million games and took 40 minutes CPU time on my laptop.

Here is an example game: (The GUI was done in pygame btw)


Cat & Mouse start randomly. Mouse is taught to run away from the stationary cat.

And here’s an example mistake that it made even after 2 million training games:


Example mistake, after 2 million training games.  The mouse takes a step towards the cat.

I played about with different training rate values (alpha in the code) but the learning rate didn’t seem dependent upon it.

I tested increasing the hidden net from 16 to 32, and that made a pretty big difference. To reach an accuracy of 1.2 mistakes per game took:

  • 16×16 hidden layer took 8m40s and 716342 games
  • 32×32 hidden layer took 3m29s and 239721 games
  • 64×64 hidden layer took 3m13s and 216513 games
  • 128×128 hidden layer took 5m45s and 279732 games

Interestingly, if you compare the first 100,000 games the neural net size makes hardly no difference at all.  They all get down to an error of about 2 at about the same rate.  It’s also cool to see that the large 64×64 neural net takes about 30,000 training games to catch up with the small neural networks, since it has a much larger matrix to tame. Yet the 128×128 is much quicker to train. I don’t know why.

Errors per game

I also wrote a small program to display the synapses_0 matrix, which converts the input to the hidden matrix size, and plotted its against time.  I also attempt to show how it is mapping the four inputs to the output by showing our four colours would be transformed by the matrix.  While it is pretty to watch, it’s hard to see anything useful from it.


The synapses_0 matrix values, plotted in grayscale, and how it combines with the inputs.  Each frame is 10,000 training games.

The initial and final state:


For the sake of completeness, this is the graphics.py

import pygame


# This sets the margin between each cell
WINDOW_SIZE = [255, 255]

def lightened(color, amount):
  h, s, l, a = color.hsla
  if l+amount &amp;gt; 100: l = 100
  elif l+amount &amp;lt; 0: l = 0
  else: l += amount
  color.hsla = (h, s, l, a)
  return color

def init(map_width, map_height):
    global WINDOW_SIZE, MAP_WIDTH, MAP_HEIGHT, screen, clock, cat_image, mouse_image, grid_width, grid_height
    MAP_WIDTH = map_width
    MAP_HEIGHT = map_height
    grid_width = (WINDOW_SIZE[0] - MARGIN) / MAP_WIDTH - MARGIN
    grid_height = (WINDOW_SIZE[1] - MARGIN) / MAP_HEIGHT - MARGIN

    # Round the window size, so that we don't have fractions
    WINDOW_SIZE = ((grid_width + MARGIN) * MAP_WIDTH + MARGIN, (grid_height + MARGIN) * MAP_HEIGHT + MARGIN)

    # Initialize pygame
    # Set the HEIGHT and WIDTH of the screen
    screen = pygame.display.set_mode(WINDOW_SIZE)
    # Set title of screen
    pygame.display.set_caption(&amp;quot;LSTM Neural Net Cat and Mouse&amp;quot;)

    # Used to manage how fast the screen updates
    clock = pygame.time.Clock()
    cat_image = pygame.image.load(&amp;quot;cat.png&amp;quot;)
    mouse_image = pygame.image.load(&amp;quot;mouse.png&amp;quot;)

    mouse_image = pygame.transform.smoothscale(mouse_image, (grid_width, grid_height))
    cat_image = pygame.transform.smoothscale(cat_image, (grid_width, grid_height))

def updateGraphics(cat_x, cat_y, mouse_x, mouse_y):
    for event in pygame.event.get():  # User did something
        if event.type == pygame.QUIT:  # If user clicked close
            done = True  # Flag that we are done so we exit this loop
            return False

    # Set the screen background

    # Draw the grid
    for row in range(MAP_HEIGHT):
        for column in range(MAP_WIDTH):
                             [(MARGIN + grid_width) * column + MARGIN,
                              (MARGIN + grid_height) * row + MARGIN,
    for i in range(len(mouse_x)):
        color = pygame.Color(&amp;quot;green&amp;quot;)
        color = lightened(color, -30*i/len(mouse_x))
                             [(MARGIN + grid_width) * mouse_x[i] + MARGIN*2,
                              (MARGIN + grid_height) * (MAP_HEIGHT - mouse_y[i] - 1) + MARGIN*2,
                              grid_width - MARGIN*2,
                              grid_height - MARGIN*2])
        if i != len(mouse_x) - 1:
                             [((MARGIN + grid_width) * mouse_x[i] + (MARGIN + grid_width)/2,
                              (MARGIN + grid_height) * (MAP_HEIGHT - mouse_y[i] - 1) + (MARGIN + grid_height)/2),
                              ((MARGIN + grid_width) * mouse_x[i+1] + (MARGIN + grid_width)/2,
                              (MARGIN + grid_height) * (MAP_HEIGHT - mouse_y[i+1] - 1) + (MARGIN + grid_height)/2)],
    for i in range(len(cat_x)):
        color = pygame.Color(&amp;quot;red&amp;quot;)
        color = lightened(color, -30*i/len(cat_x))
                             [(MARGIN + grid_width) * cat_x[i] + MARGIN*2,
                              (MARGIN + grid_height) * (MAP_HEIGHT - cat_y[i] - 1) + MARGIN*2,
                              grid_width - MARGIN*2,
                              grid_height - MARGIN*2])
        if i != len(cat_x) - 1:
                             [((MARGIN + grid_width) * cat_x[i] + (MARGIN + grid_width)/2,
                              (MARGIN + grid_height) * (MAP_HEIGHT - cat_y[i] - 1) + (MARGIN + grid_height)/2),
                              ((MARGIN + grid_width) * cat_x[i+1] + (MARGIN + grid_width)/2,
                              (MARGIN + grid_height) * (MAP_HEIGHT - cat_y[i+1] - 1) + (MARGIN + grid_height)/2)],
    screen.blit(mouse_image, ((MARGIN + grid_width) * mouse_x[-1] + MARGIN,
                              (MARGIN + grid_height) * (MAP_HEIGHT - mouse_y[i-1] - 1) + MARGIN))
    screen.blit(cat_image, ((MARGIN + grid_width) * cat_x[-1] + MARGIN,
                              (MARGIN + grid_height) * (MAP_HEIGHT - cat_y[i-1] - 1) + MARGIN))

    # Go ahead and update the screen with what we've drawn.

    # Limit to 60 frames per second
    return True