Learning Machine Learning

Heh, funny title.

I became interested in machine learning working for Nokia.  I worked on Nokia’s Z Launcher application for Android.  You can scribble a letter (or multiple), and it would recognize it and search for it.  The app is available for download in the Play Store.

screenshot_2014-06-20-17-16-41

I worked on the Nokia Z Launcher’s handwriting recognition

Specifically I was tasked with optimizing the speed of the recognition.  I don’t know if I can state any specifics on how the character recognition was done, but I will say that I managed to increase the speed of the recognition a hundred fold.

But this recognition was actually a relatively simple task, compared to modern day deep neural networks, but it really whet my appetite to understand more.

When Alpha Go beat Lee Sedol, I knew that I simply must understand Deep Neural Networks.

Below is my journey in understanding, along with my reflective thoughts:

  1. I started off naively implementing an LSTM neural network without any training.
    I wanted to have a crude appreciation of the problems before I read about the solutions.  My results are documented in my previous post here, but looking back it’s quite embarrassing to read.  I don’t regret doing this at all however.
  2. Next I did the Andrew Ng Coursera Machine Learning course.
    This is an 11 week course in the fundamentals.  I completed it along with all the coursework, but by the end I felt that my knowledge was about 20 years out of date.  It was really nice to get the fundamentals, but none of the modern discoveries were discussed at all.  Nothing about LSTM, or Markov Chains, or dropouts, etc.
    The exercises are also all done in Matlab/Octave, which has fallen out of favour and uses a lot of support code.  I certainly didn’t feel comfortable in implementing a neural network system from scratch after watching the course.

    coursera_machine_learning_certificate

    Passed the Coursera Machine learning course with 97.6% score.

    The lecturer, Andrew Ng, was absolutely awesome.  My complaints, really, boil down to that I wish the course was twice as long and that I could learn more from him!  I now help out in a machine learning chat group and find that most of the questions that people ask about TensorFlow, Theano etc are actually basics that are answered very well by Andrew Ng’s course.  I constantly direct people to the course.

  3. Next, Reinforcement Learning.  I did a 4 week Udacity course UD820.
    This was a done by a pair of teachers that are constantly joking with each other.  I first I thought it would annoy me, but I actually really enjoyed the course – they work really well together.  They are a lot more experienced and knowledgeable than they pretend to be.  (They take it in turns to be ignorant, to play the role of a student).
  4. I really wanted to learn about TensorFlow, so I did a 4 week Udacity Course UD730
    Again I did all the coursework for it.  I thought the course was pretty good, but it really annoyed me that each video was about 2 minutes long!  Resulting in a 30 second pause every 2 minutes while it loaded up the next video.  Most frustrating.
  5. At this point, I started reading papers and joined up for the Visual Doom AI competition.
    I have come quite far in my own Visual Doom AI implementation, but the vast majority of the work is the ‘setup’ required.  For example, I had to fix bugs in their doom engine port to get the built-in AI to work.  And it was a fair amount of work to get the game to run nicely with TensorFlow, with mini-batch training, testing and verification stages.
    I believe I understand how to properly implement a good AI for this, with the key coming from guided policy search, in a method similar to that pioneered by google for robotic control.  (Link is to a keynote at the International Conference on Learning Representations 2016).   The idea is to hack the engine to give me accurate internal data of the positions of enemies, walls, health, etc that I can use to train a very simple ‘teacher’.  Then use that teacher to supervise a neural network that has only visual information, thus allowing us to train a deep neural network with back-propagation.  By alternating between teacher and student, we can converge upon perfect solutions.  I hope to write more about this in a proper blog post.
  6. The International Conference on Learning Representations (ICLR) 2016
    Videos were absolutely fascinating, and were surprisingly easy to follow with the above preparation.
  7. I listened to the entire past two years of the The Talking Machines podcast.  It highlighted many areas that I was completely unfamiliar with, and highlighted many things that I knew about but just didn’t realise were important.
  8. I did the Hinton Coursera course on Neural Networks for Machine Learning, which perfectly complemented the Andrew Ng Coursera Course.  I recommend these two courses the most, for the foundations.  It is about 5 years out of date, but is all about the fundamentals.
    coursera
  9. I did the Computational Neuroscience course.  The first half was interesting and was about, well, neuroscience.  But the math lectures were in a slow monotonic tone that really put me straight to sleep.  The second half of the course was just a copy of the Andrew Ng course (they even said so), so I just skipped all the lectures and did the exams with no problems.  I really liked that they let you do the homework in python.  It is easier in matlab, but I really wanted to improve my python datascience skills.  The exams took way way longer than their predicted times.  They would have 20 questions, requiring you to download, run and modify 4 programs, and say that you should be able to do it in an hour!
    coursera
  10. I completed the IBM Artificial Intelligence class.  This is the second most expensive AI class that I’ve done, at $1600 plus $50 for the book.   I was really not at all impressed by it.  Most of the videos are 1 minute long, or less, each.  Which means that I’m spending half the time waiting for the next video to load up.  The main lecturer gets on my own nerves – he wears a brightly colored Google Glass for no apparent reason other than to show it off, and speaks in the most patronizing way.  You get praised continually for signing up to the course at the start.  It’s overly specialized.  You use their specific libraries which aren’t at all production ready, and you use toy data with a tiny number of training examples.  (e.g. trying to train for the gesture ‘toy’ with a single training example!).  Contrast this against the Google Self Driving course:
  11. The Google Self Driving course, which is $2400.  This is much better than the IBM course.  The main difference is that they’ve made the theme self driving cars, but you do it all in TensorFlow and you learn generic techniques that could be applied to any machine learning field.  You quickly produce code that could be easily made production ready.  You work with large realistic data, with large realistic neural networks, and they teach you use the Amazon AWS servers to train the data with.  The result is code that can be (and literally is!) deployed to a real car.
    nd013

Machine Translated renaming of files

I needed to rename a whole load of files that were in Japanese.  So I wrote a python program that translates the filename using google translate.

It’s not at all fancy, just run it and pass the filenames to translate and rename.

E.g.:

$ ls
こんにちは世界.png
$ sudo pip3 install googletrans
$ translate_rename.py こんにちは世界.png
こんにちは世界.png -> Hello_World.png
$ ls
Hello_World.png
#!/usr/bin/python3

import sys, re, os
from googletrans import Translator

translator = Translator()

sourceLanguage = 'ja'
destLanguage = 'en'

# Set to false to actually rename the files
dryRun = True

def translate_and_rename(filename):
    filenameSplit = filename.rsplit('.',1)
    translated = translator.translate(filenameSplit[0], src=sourceLanguage, dest=destLanguage).text
    translated = re.sub( '[^a-zA-Z0-9.]+', '_', translated).strip().title()
    if len(filenameSplit) > 1:
        translated += '.' + filenameSplit[1]
    if filename == translated:
        print(filename, ' (unchanged)')
    else:
        print(filename, " -> ", translated)
        if not dryRun:
            os.rename(filename, translated)

def main(argv):
    if len(argv) == 1:
        print("Need to pass filenames to translate and rename")

    for x in argv[1:]:
        translate_and_rename(x)
    if dryRun:
        print()
        print("  Dry run only - no actual changes made  ")
        print()
        print("Edit this file and set DryRun to True")

if __name__ == "__main__":
    main(sys.argv)

I worked on mitigating CPU ‘Meltdown’ bug, 5 years before it was known about

Maybe a clickbait title, sorry, but I couldn’t think of a better title.

The CPU ‘Meltdown’ bug affects Intel CPUs, and from Wikipedia:

Since many operating systems map physical memory, kernel processes, and other running user space processes into the address space of every process, Meltdown effectively makes it possible for a rogue process to read any physical, kernel or other processes’ mapped memory—regardless of whether it should be able to do so. Defenses against Meltdown would require avoiding the use of memory mapping in a manner vulnerable to such exploits (i.e. a software-based solution) or avoidance of the underlying race condition (i.e. a modification to the CPUs’ microcode and/or execution path).

This separation of user and kernel memory space is exactly what I worked on from 2012 to 2014 on behalf on Deutsch Telekom using the L4 hypervisor:

arch-gen

The idea was to give each service its own separate memory space, designing in a way such that you assume that the main OS has been compromised and is not trustworthy (e.g. because of the Meltdown bug). I personally worked on the graphics driver – splitting the kernel graphics driver into two parts – one side for the app to talk to and has to be considered compromised, and one side that actually talks to the hardware.

Here’s my work in action:

anddycccqaaqn84

1309_simko_s3

simko3

Yes, I did actually use Angry Birds as my test. Fully hardware accelerated too 🙂

Unfortunately the problem was that it took too long to port each phone across.  It took me a year to port across graphics driver changes, and a similar time for my colleagues to do the other drivers.  And then another year for it to actually hit the market.  The result is that the phone was always over 2 years out of date by the time it hit the market, which is a long time in mobile phone times.

Still, our software would be immune to this type of bug, and that’s kinda cool.  Even if it did fail commercially 😉

TypeScript + lodash map and filter

I love TypeScript.  I use it whenever I can.  That said, sometimes it can be…  interesting.  Today, out of the blue, I got the typescript error in code that used to work:

[06:53:30]  typescript: src/mycode.ts, line: 57 
            Property 'video' does not exist on type 'number | (<U>(callbackfn: (value: Page, index: number, 
            array: Page[]) => U, thisA...'. Property 'video' does not exist on type 'number'. 

 

The code looks like:

return _.chain(pages)
        .filter((s, sIdx) => s.video || s.videoEmbedded)
        .map((s, sIdx) => {
            if (s.video) { ... }

Can you spot the ‘error’?

The problem is that s.video || s.videoEmbedded isn’t returning a boolean. It’s return a truthy value, but not a boolean. And the lodash typescript developers made a change 1 month ago that meant that filter() would only accept booleans, not any truthy value. And the lodash typescript developers are finding that fixing this becomes very complicated and complex. See the full conversation here:

https://github.com/DefinitelyTyped/DefinitelyTyped/issues/21485

(Open issue at time of writing. Please leave me feedback or message me if you see this bug get resolved)

The workaround/fix is to just make sure it’s a boolean. E.g. use !! or Boolean(..) or:

return _.chain(pages)
        .filter((s, sIdx) => s.video !== null || s.videoEmbedded !== null )
        .map((s, sIdx) => {
            if (s.video) { ... }

Box for cat

Just a light hearted post today.  My daughter had a great idea to build a felt house for her teddy cat, so we did it together.

It was a fun project, and simple to make.  The wood is just children’s blocks glued together, and the felt is stapled on.  The strap was cut from an ecobag.  A piece of velcro was added to hold the door open.

I don’t like PID control

 

I’ve implemented PID control far too many times now, and I’ve never been that happy with the results.

It is simple to implement, but tricky to tune.  My thick control theory book says that 95% of commercial in-use PID systems are sub-optimally tuned.  And I can believe it.

I’ve been recently playing with other systems, and currently I really like Model Predictive Control.  It does require some significant compute power, which was a big problem when it was invented, but not so much any more.   (As a side point, I’d like to implement MPC in tensorflow.  That would be cool, and could probably be done in a weekend).

I wrote a PID control system for a self driving car.  It is dumbly auto-tuned, so to speak, by a simple twiddle method.  It tests out a particular set of PID values for one minute, resets the track, then picks a new set of PID values based on the best seen values, changed slightly.

The results are…  very underwhelming.

 

The same driving but using Model Predictive Control, below, starts to look a lot nicer.  For reference, the PID control took a total of two days to implement, and the MPC took a total of three days.   The MPC version also has an additional complication – there’s a simulated delay of 100ms.

But I’m very pleased with the results.  It’s a bit jerky here due to random latency variations, made worse by the screen recording software.  I’ve got some ideas on trying to solve that by timestamping, but I don’t think I’ll have time to work on it.

I would love to try applying this MPC system to a Quadcopter.  I’ve written quite a bit previously about latency problems with PID in a quadcopter:  (In this video, the ufo-shaped quadcopter is attempting to move to, and maintain, a fixed position above the chair.  Ignore the blue man 🙂 )

 

 

Why Kalman Filters are broken, and how to fix

Okay, so clickbait title, because I’m referring to the most common case of using a single Gaussian distribution to represent the current estimated position (for example) and for the measurement.

Here’s my problem with that:

They don’t take into account that sometimes things go wrong.

Now you may say ‘duh’, but it can cause some really serious problems if something does go wrong.

Say you have a self driving car.  And the following happens:

  • Prior: You currently estimate the position of a person in front of you to be at position:  x = 2m ± 0.5m     aka   X ~ N(2,0.5)
  • Data:  Your instruments return a new measure:  the person is now measured to be at position: x = 10m ± 0.5m   aka    X ~ N(10,0.5)
  • Posterior: You apply your Kalman Filter and it concludes that the position of the person is now at:  x = 6m ± 0.25m  aka   X ~ N(6,0.25)

gaussian1

  • You are now extremely confident that the person is at x=6m despite you previously thinking that they were at a x=2m and your measurement data says that they are at x=10m.  You, the self driving car, now happily drive at x=10m.

 

How confident would you, as a, human, really be that the person is at x=6m?  Barely at all, right?  So something has gone wrong here.

Here’s an example in a simulator.  A car is driving from right to left and we are trying to track its true position with a Extended Kalman Filter.  ◯ The red circles indicate data from the simulated LASER, ◔ blue circle circles are RADAR, and ▲ green triangles are the position estimated by a Kalman filter.  At the bottom of the arc, a single bad LASER data point is injected.

Screenshot_20170801_170644

Credit to Udacity for the simulator.

The result is that the predicted position goes way off, and takes many times steps to recover.

The blame all lies on Sherlock Holmes:

sherlock

The product of N(2,0.5) * N(10,0.5) results in a ridiculous improbable scenerio – we have to scale up the result by 10^{14} .  i.e. treating the highly improbable as truth.  Perhaps what Sherlock should have said is that it is far more likely that you were simply mistaken in elimination!  i.e. your prior or data was wrong!

Fixing the basic Kalman Filter

Here’s a small change we could make.  We can add a small probability of ‘wtf’ to our system.

  • Prior: You currently estimate the position of a person in front of you to be at position:  x_{prior} = 2 \pm 0.5   plus a W_{prior} chance of ‘wtf’:   aka:X \sim (1-W_{prior}) \cdot N(2,0.5) + W_{prior} \cdot \textrm{WTF}
  • Data:  Your instruments return a new measure:  the person is now measured to be at position: x_{data} = 10 \pm 0.5   plus a W_{data} chance of ‘wtf’:   aka:
    X \sim (1-W_{data}) \cdot N(10,0.5) + W_{data} \cdot \textrm{WTF}

  • Posterior: You apply your bayes rule.  What do we get now?  We need the product of these two random variables, and then normalize, by setting C such that the total integral is 1.\displaystyle{ \left( (1-W_{prior}) \cdot N(2,0.5) + W_{prior} \cdot \textrm{WTF} \right) \cdot \left(  (1-W_{data}) \cdot N(10,0.5) + W_{data} \cdot \textrm{WTF} \right) / C}(note that N(10,0.5) \cdot N(2,0.5) is proportionate to a new Gaussian scaled by a factor S)= ( (1-W_{prior}) \cdot (1-W_{data}) \cdot N(6,0.25) \cdot S + (1-W_{prior}) \cdot W_{data} \cdot N(2,0.5) +
    W_{prior} \cdot (1-W_{data}) \cdot N(10,0.5) + W_{prior} \cdot W_{data} \cdot \textrm{WTF}^2 ) / C We know that the area must equal 1 for a PDF, so we can determine the scale factor C.  See Appendix below for any technical details, including the formula for S and C.So we have a solution, but the solution is a gaussian mixture model – a sum of a gaussian for the combination, a gaussian for the prior, and a gaussian for the data.  So let’s simply pick the one with the highest probability, and roll the other two into the WTF weight.

Here’s the result as an gif – the posterior is where the posterior would normally be, and the shaded blue area is our ‘wtf’ version of the posterior as a weighted mixture model:

anim.gif

Various different priors and data.  There’s no updating step here – just showing different mean values for the prior and data.  The posterior is kept centered to make it easier to see.

It looks pretty reasonable!  If the prior and data agree to within their uncertainties, then the mixture model just returns the posterior.  And when they disagree strongly, we return a mixture of them.

Let’s now make an approximation of that mixture model to a single gaussian, by choosing a gaussian that minimizes the KL divergence.

Screenshot_20170726_015120

Which results in the red line ‘Mixture Model Approximated’:

animapprox.gif

Different priors and data, with the red line showing this new method’s output.  As above, there’s no update step here – it’s not a time series.

Now this red line looks very reasonable!  When the prior and data disagree with each other, we have a prediction that is wide and covers both of them.  But when they agree with each other to within their uncertainties, we get a result that properly combines both and gives us a posterior result that is more certain than either – i.e. we are adding information.

In total, the code to just take a prior and data and return a posterior is this:
(Note: I’ve assumed there are no covariances – i.e. no correlation in the error terms)

# prior input
mu1 = 5
variance1 = 0.5
# data input
mu2 = 6
variance2 = 0.5
# This is what you would normally do, in a standard Kalman Filter:
variance3 = 1/(1/variance1 + 1/variance2)
mu3 = (mu1/variance1 + mu2/variance2)*variance3

#But instead we will do:
S = 1/(np.sqrt(2*np.pi*variance1*variance2/variance3)) * np.exp(-0.5 * (mu1 - mu2)**2 * variance3/(variance1*variance2))
Wprior = 0.01 # 1% chance our prior is wrong
Wdata = 0.01 # 1% chance our data is wrong
C = (1-Wprior) * (1-Wdata) * S + (1-Wprior) * Wdata + Wprior * (1-Wdata)
weights = [
Wprior * (1-Wdata) /C, # weight of prior
(1-Wprior) * Wdata /C, # weight of data
(1-Wprior) * (1-Wdata) * S/C ] #weight of posterior
mu4 = weights[0]*mu1 + weights[1]*mu2 + weights[2] * mu3
variance4 = weights[0]*(variance1 + (mu1 - mu4)**2) + \
weights[1]*(variance2 + (mu2 - mu4)**2) + \
weights[2]*(variance3 + (mu3 - mu4)**2)

So at mu3 and variance3  is the green curve in our graphs and is the usual approach, while mu4 and variance4 is my new approach and is a better estimation of posterior.

The full ipython notebook code for everything produced here, is in git, here:

https://github.com/johnflux/Improved-Kalman-Filter/blob/master/kalman.ipynb

Results: Testing out in Simulator

Unfortunately, in my simulator my kalman filter has a velocity component, which I don’t actually take into in my analysis yet.  My math completely ignores any dependent variables.  The problem is that when we increase the uncertainty of the position then we also need to increase the uncertainty of the velocity.

To salvage this, I used just a small part of the code:

<div>
<div>double Scale = 1/(sqrt(2*M_PI*variance1*variance2/variance3)) * exp(-0.5 * (mu1 – mu2)*(mu1 – mu2) * variance3/(variance1*variance2));</div>
<div>
<div>
<div>if (Scale < 0.00000000001)
return; /* Just completely ignore this data point */</div>
</div>
</div>
</div>

The result is this:

Screenshot_20170802_012822

Appendix:

Just some technical notes, for completeness.

  • We can set WTF = 1/(2A) where it can be anywhere between A and -A.  We can take A to be large or even take the limit to infinity.
  • It is a Probability Density Function since the area is 1:  \int^A_{-A} 1/(2A) dx  =  1 , and can set the Cauchy principal value of the integral to be 1 to keep it a PDF in the limit.
  • \int^A_{-A} \textrm{WTF}^2(x) dx \to 0 since:  \int^A_{-A} 1/(2A)^2 dx  =  2A/(4A^2) = 1/(2A)   which tends to 0 as A  \to \infty.
  • The scale factor S is simple but tedious to calculate:
    Screenshot_20170725_224207
  • The scale factor C can then be calculated by just setting the integral to equal 1 since it’s a PDF:  C = (1-W_{prior}) \cdot (1-W_{data}) \cdot S + (1-W_{prior}) \cdot W_{data} + W_{prior} \cdot (1-W_{data})

Python heat map image

For my self-driving car, I detect cars in the image by scanning over the image in windows of varying sizes, and use the detection results to build up a ‘heat map’ of the most likely areas for where the cars are.

For visualization I came up with this idea:

heatmap_integrated_4

In code:

heat_img = plt.cm.hot(heat/np.max(heat))
heat_img[heat&lt;1,3] = 0 image2 = np.copy(image) image2[heat&gt;1,:] = image2[heat&gt;1,:]/3 + (heat_img[heat&gt;1,0:3]*256*2/3).astype(np.uint8)
plt.imshow(image2)

My Self Driving Car on a difficult track at speed

Well, in a simulator 🙂

(Sorry about the bad filming – it was the easiest way to get even a semi-decent quality)

Code is here: https://github.com/johnflux/CarND-Behavioral-Cloning-P3

I basically follow the NVidia End to End Learning for Self-Driving Cars paper with a few tweaks to add in dropout, slightly larger Dense layers to compensate, and an additional convolution layer to handle the different input image resolution.  It runs in real time on a GTX 980 Ti.