Learning Machine Learning

Heh, funny title.

I became interested in machine learning working for Nokia.  I worked on Nokia’s Z Launcher application for Android.  You can scribble a letter (or multiple), and it would recognize it and search for it.  The app is available for download in the Play Store.


I worked on the Nokia Z Launcher’s handwriting recognition

Specifically I was tasked with optimizing the speed of the recognition.  I don’t know if I can state any specifics on how the character recognition was done, but I will say that I managed to increase the speed of the recognition a hundred fold.

But this recognition was actually a relatively simple task, compared to modern day deep neural networks, but it really whet my appetite to understand more.

When Alpha Go beat Lee Sedol, I knew that I simply must understand Deep Neural Networks.

Below is my journey in understanding, along with my reflective thoughts:

  1. I started off naively implementing an LSTM neural network without any training.
    I wanted to have a crude appreciation of the problems before I read about the solutions.  My results are documented in my previous post here, but looking back it’s quite embarrassing to read.  I don’t regret doing this at all however.
  2. Next I did the Andrew Ng Coursera Machine Learning course.
    This is an 11 week course in the fundamentals.  I completed it along with all the coursework, but by the end I felt that my knowledge was about 20 years out of date.  It was really nice to get the fundamentals, but none of the modern discoveries were discussed at all.  Nothing about LSTM, or Markov Chains, or dropouts, etc.
    The exercises are also all done in Matlab/Octave, which has fallen out of favour and uses a lot of support code.  I certainly didn’t feel comfortable in implementing a neural network system from scratch after watching the course.


    Passed the Coursera Machine learning course with 97.6% score.

    The lecturer, Andrew Ng, was absolutely awesome.  My complaints, really, boil down to that I wish the course was twice as long and that I could learn more from him!  I now help out in a machine learning chat group and find that most of the questions that people ask about TensorFlow, Theano etc are actually basics that are answered very well by Andrew Ng’s course.  I constantly direct people to the course.

  3. Next, Reinforcement Learning.  I did a 4 week Udacity course UD820.
    This was a done by a pair of teachers that are constantly joking with each other.  I first I thought it would annoy me, but I actually really enjoyed the course – they work really well together.  They are a lot more experienced and knowledgeable than they pretend to be.  (They take it in turns to be ignorant, to play the role of a student).
  4. I really wanted to learn about TensorFlow, so I did a 4 week Udacity Course UD730
    Again I did all the coursework for it.  I thought the course was pretty good, but it really annoyed me that each video was about 2 minutes long!  Resulting in a 30 second pause every 2 minutes while it loaded up the next video.  Most frustrating.
  5. At this point, I started reading papers and joined up for the Visual Doom AI competition.
    I have come quite far in my own Visual Doom AI implementation, but the vast majority of the work is the ‘setup’ required.  For example, I had to fix bugs in their doom engine port to get the built-in AI to work.  And it was a fair amount of work to get the game to run nicely with TensorFlow, with mini-batch training, testing and verification stages.
    I believe I understand how to properly implement a good AI for this, with the key coming from guided policy search, in a method similar to that pioneered by google for robotic control.  (Link is to a keynote at the International Conference on Learning Representations 2016).   The idea is to hack the engine to give me accurate internal data of the positions of enemies, walls, health, etc that I can use to train a very simple ‘teacher’.  Then use that teacher to supervise a neural network that has only visual information, thus allowing us to train a deep neural network with back-propagation.  By alternating between teacher and student, we can converge upon perfect solutions.  I hope to write more about this in a proper blog post.
  6. The International Conference on Learning Representations (ICLR) 2016
    Videos were absolutely fascinating, and were surprisingly easy to follow with the above preparation.
  7. I listened to the entire past two years of the The Talking Machines podcast.  It highlighted many areas that I was completely unfamiliar with, and highlighted many things that I knew about but just didn’t realise were important.
  8. I did the Hinton Coursera course on Neural Networks for Machine Learning, which perfectly complemented the Andrew Ng Coursera Course.  I recommend these two courses the most, for the foundations.  It is about 5 years out of date, but is all about the fundamentals.
  9. I did the Computational Neuroscience course.  The first half was interesting and was about, well, neuroscience.  But the math lectures were in a slow monotonic tone that really put me straight to sleep.  The second half of the course was just a copy of the Andrew Ng course (they even said so), so I just skipped all the lectures and did the exams with no problems.  I really liked that they let you do the homework in python.  It is easier in matlab, but I really wanted to improve my python datascience skills.  The exams took way way longer than their predicted times.  They would have 20 questions, requiring you to download, run and modify 4 programs, and say that you should be able to do it in an hour!
  10. I completed the IBM Artificial Intelligence class.  This is the second most expensive AI class that I’ve done, at $1600 plus $50 for the book.   I was really not at all impressed by it.  Most of the videos are 1 minute long, or less, each.  Which means that I’m spending half the time waiting for the next video to load up.  The main lecturer gets on my own nerves – he wears a brightly colored Google Glass for no apparent reason other than to show it off, and speaks in the most patronizing way.  You get praised continually for signing up to the course at the start.  It’s overly specialized.  You use their specific libraries which aren’t at all production ready, and you use toy data with a tiny number of training examples.  (e.g. trying to train for the gesture ‘toy’ with a single training example!).  Contrast this against the Google Self Driving course:
  11. The Google Self Driving course, which is $2400.  This is much better than the IBM course.  The main difference is that they’ve made the theme self driving cars, but you do it all in TensorFlow and you learn generic techniques that could be applied to any machine learning field.  You quickly produce code that could be easily made production ready.  You work with large realistic data, with large realistic neural networks, and they teach you use the Amazon AWS servers to train the data with.  The result is code that can be (and literally is!) deployed to a real car.

Tensorflow for Neurobiologists

I couldn’t find anyone else that has done this, so I made this really quick guide.  This uses tensorflow which is a complete overkill for this specific problem, but I figure that a simple example is much easier to follow.

Install and run python3 notebook, and tensorflow.  In Linux, as a user without using sudo:

$ pip3 install --upgrade --user ipython[all] tensorflow matplotlib
$ ipython3  notebook

Then in the notebook window, do New->Python 3

Here’s an example I made earlier. You can download the latest version on github here: https://github.com/johnflux/Spike-Triggered-Average-in-TensorFlow

Spike Triggered Average in TensorFlow

The data is an experimentally recorded set of spikes recorded from the famous H1 motion-sensitive neuron of the fly (Calliphora vicina) from the lab of Dr Robert de Ruyter van Steveninck.

This is a complete rewrite of non-tensorflow code in the Coursera course Computational Neuroscience by University of Washington. I am thoroughly enjoying this course!

Here we use TensorFlow to find out how the neuron is reacting to the data, to see what causes the neuron to trigger.

%matplotlib inline
import pickle
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()

FILENAME = 'data.pickle'

with open(FILENAME, 'rb') as f:
    data = pickle.load(f)

stim = tf.constant(data['stim'])
rho = tf.constant(data['rho'])
sampling_period = 2 # The data was sampled at 500hz = 2ms
window_size = 150 # Let's use a 300ms / sampling_period sliding window

We now have our data loaded into tensorflow as a constant, which means that we can easily ‘run’ our tensorflow graph. For example, let’s examine stim and rho:

print("Spike-train time-series =", rho.eval(),
      "\nStimulus time-series     =", stim.eval())
Spike-train time-series = [0 0 0 ..., 0 0 0] 
Stimulus time-series    = [-111.94824219  -81.80664062 
    10.21972656 ...,  9.78515625 24.11132812 50.25390625]

rho is an binary array where a 1 indicates a spike. Let’s turn that into an array of indices of where the value is 1, but ignoring the first window_size elements.

Note: We can use the [] and + operations on a tensorflow variable, and it correctly adds those operations to the graph. This is equivalent to using the tf.slice and tf.add operations.

spike_times = tf.where(tf.not_equal(rho[window_size:-1], 0)) + window_size
print("Time indices where there is a spike:\n", spike_times.eval())
Time indices where there is a spike:
 [[   158]
 [   160]
 [   162]
def getStimWindow(index):
    i = tf.cast(index, tf.int32)
    return stim[i-window_size+1:i+1]
stim_windows = tf.map_fn(lambda x: getStimWindow(x[0]), spike_times, dtype=tf.float64)
spike_triggered_average = tf.reduce_mean(stim_windows, 0).eval()
print("Spike triggered averaged is:", spike_triggered_average[0:5], "(truncated)")
Spike triggered averaged is: [-0.33083048 -0.29083503 -0.23076012 -0.24636984 -0.10962767] (truncated)

Now let’s plot this!

time = (np.arange(-window_size, 0) + 1) * sampling_period
plt.plot(time, spike_triggered_average)
plt.xlabel('Time (ms)')
plt.title('Spike-Triggered Average')



It’s… beautiful!

What we are looking at here, is that we’ve discovered that our neuron is doing a leaky integration of the stimulus. And when that integration adds up to a certain value, it triggers.

Do see the github repo for full source: https://github.com/johnflux/Spike-Triggered-Average-in-TensorFlow

Update: I was curious how much noise there was. There’s a plot with 1 standard deviation in light blue:

mean, var = tf.nn.moments(stim_windows,axes=[0])
plt.errorbar(time, spike_triggered_average, yerr=tf.sqrt(var).eval(), ecolor="#0000ff33")


Yikes!  This is why the input signal MUST be Gaussian, and why we need lots of data to average over.  For this, we’re averaging over 53583 thousand windows.

Coursera Neural Networks for Machine Learning

I’ve completed another 4 month Coursera Machine Learning course. The first was the Andrew NG’s Machine Learning course, and the second is Geoffrey Hinton’s Neural Networks for Machine Learning course.  I also completed a 2 month Coursera course in Computation Neuroscience.


Course certificate

Both courses use Matlab, and this second one was significantly more work and buggier. Several of the exam questions had key bits of information accidentally left out and referred to images that didn’t exist. Only by referring to the user forums could the key bits of information be found out. However I was one of the first people to do this course, so bugs are to be expected.

Technical issues aside, I enjoyed the course a lot. It’s approximately 5 years behind the current state of the field, but the information is still very relevant.

Edit: Update on 2017/2/5 – I also did the 2 month Computational Neuroscience course:


I enjoyed the first half of the course, which was all new and interesting to me.  But the second half was just a basic version of the Andrew Ng course.



Dilation Animation

There is a neat git repository that generates animations of various different settings. But unfortunately it didn’t show dilation – which is one of my favourite Convolutional Neural Network techniques.

Edit: My changes are now merged

So I added support.

And here’s what my patch generates:  The bottom layer is the input layer, the top is the output layer.


No stride, no padding, dilation of 2, kernel size of 3

The idea with dilation is that at each layer you double the dilation size.  That way your receptive field grows exponentially with each layer, allowing you to deal with large images without pooling and thus have pixel sharpness.


Second order back propagation

This is a random idea that I’ve been thinking about. A reader messaged me to say that this look similar to online l-bgfs. To my inexperienced eyes, I can’t see this myself, but I’m still somewhat a beginner.


Yes, I chose a cat example simply so that I had an excuse to add cats to my blog.

Say we are training a neural network to take images of animals and classify the image as being an image of a cat or not a cat.

You would train the network to output, say, y_{cat} = 1 if the image is that of a cat.

To do so, we can gather some training data (images labelled by humans), and for each image we see what our network predicts (e.g. “I’m 40% sure it’s a cat”).  We compare that against what a human says (“I’m 100% sure it’s a cat”)  find the squared error (“We’re off by 0.6, so our squared error is 0.6^2”) and adjust each parameter, w_i, in the network so that it slightly decreases the error (\Delta w_i  = -\alpha \partial E/\partial w_i).  And then repeat.

It’s this adjustment of each parameter that I want to rethink.  The above procedure is Stochastic Gradient Descent (SGD) – we adjust w_i to reduce the error for our test set (I’m glossing over overfitting, minibatches, etc).

Key Idea

This means that we are also trying to look for a local minimum. i.e. that once trained, we want the property that if we varied any of the parameters w_i by a small amount then it should increase the expected squared error E

My idea is to encode this into the SGD update. To find a local minima for a particular test image we want:

\dfrac{\partial y_{cat}}{\partial w_i} = 0

\dfrac{\partial^2y_{cat}}{\partial w_i^2}  < 0 (or if it equals 0, we need to consider the third differential etc).

Let’s concentrate on just the first criteria for the moment. Since we’ve already used the letter E to mean the half squared error of y, we’ll use F to be the half squared error of \dfrac{\partial y_{cat}}{\partial w_i}.

So we want to minimize the half squared error F:

F = \dfrac{1}{2}\left(\dfrac{\partial y_{cat}}{\partial w_i}\right)^2

So to minimize we need the gradient of this error:

\dfrac{\partial F}{\partial w_i} = \dfrac{1}{2} \dfrac{\partial}{\partial w_i} \left(\dfrac{\partial y_{cat}}{\partial w_i}\right)^2 = 0

Applying the chain rule:

\dfrac{\partial F}{\partial w_i} = \dfrac{\partial y_{cat}}{\partial w_i} \dfrac{\partial^2 y_{cat}}{\partial w_i^2} = 0

SGD update rule

And so we can modify our SGD update rule to:

\Delta w_i = -\alpha \partial E/\partial w_i - \beta \dfrac{\partial y_{cat}}{\partial w_i} \dfrac{\partial^2 y_{cat}}{\partial w_i^2}

Where \alpha and \beta are learning rate hyperparameters.


We finished with a new SGD update rule. I have no idea if this actually will be any better, and the only way to find out is to actually test. This is left as an exercise for the reader 😀


Best-fitting to a Cumulative distribution function in python TensorFlow

I wanted to find a best fit curve for some data points when I know that the true curve that I’m predicting is a parameter free Cumulative Distribution Function.  I could just do a linear regression on the points, but the resulting function, y might not have the properties that we desire from a CDF, such as:

  • Monotonically increasing
  • y(x) \xrightarrow{x\to\infty} 1      i.e. That it tends to 1 as x approaches positive infinity
  • y(x) \xrightarrow{x\to-\infty} 0      i.e. That it tends to 0 as x approaches negative infinity

First, I take the second two points.  To deal with this, I use the sigmoid function to create a new parameter, x’:

x'(x) = \textrm{sigmoid}(x)

This has the nice property that x'(x) \xrightarrow{x\to\infty} 1 and x'(x) \xrightarrow{x\to-\infty} 0

So now we can find a best fit polynomial on x'.


y(x') = a + bx' + cx'^2 + dx'^3 + .. + zx'^n

But with the boundary conditions that:

y(0) = 0 and y(1) = 1

Which, solving for the boundary conditions, gives us:

y(0) = 0 = a
b = (1-c-d-e-...-z)

Which simplifies our function to:

y(x') = (1-c-d-e-..-z)x' + cx'^2 + dx'^3 + ex'^4 .. + zx'^n

Implementing this in tensorflow using stochastic gradient descent (code is below) we get:


20 data samples


100 data samples

(The graph title equation is wrong.  It should be   x' = sigmoid(x) then y(x') = ax' +...  I was too lazy to update the graphs sorry)

Unfortunately this curve doesn’t have one main property that we’d like – it’s not monotonically increasing – it goes above 1. I thought about it for a few hours and couldn’t think of a nice solution.

I’m sure all of this has been solved properly before, but with a quick google I couldn’t find anything.

Edit: I’m updating this 12 days later. I have been thinking in the background about how to enforce that I want my function to be monotonic. (i.e. always increasing/decreasing) I was certain that I was just being stupid and that there was a simple answer. But by complete coincidence, I watched the youtube video Breakthroughs in Machine Learning – Google I/O 2016 in which the last speaker mentioned this restriction.  It turns out that this a very difficult problem – with the first solutions having a time complexity of O(N^4):

I didn’t understand what exactly google’s solution for this is, but they reference their paper on it:  Monotonic Calibrated Interpolated Look-up Tables, JMLR 2016.  It seems that I have some light reading to do!

#!/usr/bin/env python

import tensorflow as tf

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from scipy.stats import norm

# Create some random data as a binned normal function
n_observations = 20
fig, ax = plt.subplots(1, 1)
xs = np.linspace(-3, 3, n_observations)

ys = norm.pdf(xs) * (1 + np.random.uniform(-1, 1, n_observations))

ys = np.cumsum(ys)

ax.scatter(xs, ys)

highest_order_polynomial = 3

# We have our data now, so on with the tensorflow
# Setup the model graph

# Our input is an arbitrary number of data points (That's what the 'None dimension means)
# and each input has just a single value which is a float

X = tf.placeholder(tf.float32, [None])
Y = tf.placeholder(tf.float32, [None])

# Now, we know data fits a CDF function, so we know that
# ys(-inf) = 0   and ys(+inf) = 1

# So let's set:

X2 = tf.sigmoid(X)

# So now X2 is between [0,1]

# Let's now fit a polynomial like:
#   Y = a + b*(X2) + c*(X2)^2 + d*(X2)^3 + .. + z(X2)^(highest_order_polynomial)
# to those points.  But we know that since it's a CDF:
#   Y(0) = 0 and y(1) = 1
# So solving for this:
#   Y(0) = 0 = a
#   b = (1-c-d-e-...-z)
# This simplifies our function to:
#   y = (1-c-d-e-..-z)x + cx^2 + dx^3 + ex^4 .. + zx^(highest_order_polynomial)
# i.e. we have highest_order_polynomial-2  number of weights
W = tf.Variable(tf.zeros([highest_order_polynomial-1]))

# Now set up our equation:
Y_pred = tf.Variable(0.0)
b = (1 - tf.reduce_sum(W))
Y_pred = tf.add(Y_pred, b * tf.sigmoid(X2))

for n in xrange(2, highest_order_polynomial+1):
    Y_pred = tf.add(Y_pred, tf.mul(tf.pow(X2, n), W[n-2]))

# Loss function measure the distance between our observations
# and predictions and average over them.
cost = tf.reduce_sum(tf.pow(Y_pred - Y, 2)) / (n_observations - 1)

# Regularization if we want it.  Only needed if we have a large value for the highest_order_polynomial and are worried about overfitting
# It's also not clear to me if regularization is going to be doing what we want here
#cost = tf.add(cost, regularization_value * (b + tf.reduce_sum(W)))

# Use stochastic gradient descent to optimize W
# using adaptive learning rate
optimizer = tf.train.AdagradOptimizer(100.0).minimize(cost)

n_epochs = 10000
with tf.Session() as sess:
    # Here we tell tensorflow that we want to initialize all
    # the variables in the graph so we can use them

    # Fit all training data
    prev_training_cost = 0.0
    for epoch_i in range(n_epochs):
        sess.run(optimizer, feed_dict={X: xs, Y: ys})

        training_cost = sess.run(
            cost, feed_dict={X: xs, Y: ys})

        if epoch_i % 100 == 0 or epoch_i == n_epochs-1:

        # Allow the training to quit if we've reached a minimum
        if np.abs(prev_training_cost - training_cost) &amp;lt; 0.0000001:
        prev_training_cost = training_cost
    print &quot;Training cost: &quot;, training_cost, &quot; after epoch:&quot;, epoch_i
    w = W.eval()
    b = 1 - np.sum(w)
    equation = &quot;x' = sigmoid(x), y = &quot;+str(b)+&quot;x' +&quot; + (&quot; + &quot;.join(&quot;{}x'^{}&quot;.format(val, idx) for idx, val in enumerate(w, start=2))) + &quot;)&quot;;
    print &quot;For function:&quot;, equation
    pred_values = Y_pred.eval(feed_dict={X: xs}, session=sess)
    ax.plot(xs, pred_values, 'r')

#ax.set_ylim([-3, 3])

Logistic Regression and Regularization

Tons has been written about regularization, but I wanted to see it for myself to try to get an intuitive feel for it.

I loaded a dataset from google into python (a set of images of letters) and implemented a double for-loop to run a logistic regression with different test data sizes, and different regularization parameters.  (The value shown in the graph is actually 1/regularization ).


def doLogisticRegression(trainsize, regularizer):
    fitmodel = linear_model.LogisticRegression(C=regularizer)
    train_datasubset = train_dataset[0:trainsize,:,:].reshape(trainsize, -1)
    x = fitmodel.fit(train_dataset[0:trainsize,:,:].reshape(trainsize, -1),
    return [fitmodel.score(train_datasubset, train_labels[0:trainsize]),
            fitmodel.score(valid_dataset.reshape(valid_dataset.shape[0], -1), valid_labels)
trainsizes = [50,200,300,400,500,600,700,800,900,1000,2000,3000,4000,5000, 10000, 50000, 100000, 200000];
color = 0
plots = []
for regularizer in [1, 0.1, 0.01, 0.001]:
    results = np.array([doLogisticRegression(x, regularizer) for x in trainsizes])
    dashedplot = plt.plot(trainsizes, results[:,1], '--', label=(&amp;quot;r:&amp;quot; + str(regularizer)))
    plt.plot(trainsizes, results[:,0], c=dashedplot[0].get_color(), label=(&amp;quot;r:&amp;quot; + str(regularizer)))
plt.legend(loc='best', handles=plots)

The result is very interesting. The solid line is the training set accuracy, and the dashed line is the validation set accuracy. The vertical axis is the accuracy rate (percentage of images recognized as the correct letter) and the horizontal axis is the number of training examples.

graph of accuracy against training set for logistic regression

Image to letter recognition accuracy against training size, for various values of r = 1/regularization_factor.  Solid line is training set accuracy, dotted line is validation set accuracy.

First, I find it fascinating that purely a logistic regression can produce an accuracy of recognizing letters at 82%. If you added in spell checking, and ran this over an image, you could probably get a pretty decent OCR system, from purely an logistical regression.

Second, it’s interesting to see the effect of the regularization term. At less than about 500 training examples, the regularization term only hurts the algorithm. (A value of 1 means no regularization). At about 500 training examples though, the strong regularization (really helps). As the number of training examples increases, regularization makes less and less of an impact, and everything converges at around 200,000 training samples.

It’s quite clear at this point, at 200,000 training samples, that we are unlikely to get more improvements with more training samples.

A good rule of thumb that I’ve read is that you need approximately 50 training samples per feature. Since we have 28×28 = 784 features, this would be at 40,000 training samples which is actually only a couple of percent from our peak performance at 200,000 training samples (which is 2000000/784=2551 training samples per feature).

At this point, we could state fairly confidently that we need to improve the model if we want to improve performance.

Stochastic Gradient Descent

I reran with the same data but with stochastic gradient descent (batch size 128) and no regularization.  The accuracy (after 9000 runs) on the validation set was about the same as the best case with the logistic regression (81%), but took only a fraction of the time to run.  It took just a few minutes to run, verses a few hours for the logistic regression.

Stochastic Gradient Descent with 1 hidden layer

I added a hidden layer (of size 1024) and reran. The accuracy was only marginally better (84%).  Doubling the number of runs increased this accuracy to 86%.