Update: Release! ChatGPT clone: Open Assistant + RWKV

I’ve been working on an open source ChatGPT clone, as a small part of a large group:

There’s me – John Tapsell, ML engineer 🙂 Super proud to be part of that amazing team!

Open Assistant is, very roughly speaking, one possible front end and data collection, and RWKV is one possible back end. The two projects can work together.

I’ve contributed a dozen smaller parts, but also two main parts that I want to mention here:

  • React UI for comparing the outputs of different models, to compare them. I call it: Open Assistant Model Comparer!
  • Different decoding schemes in javascript for RWKV-web – a way to run RWKV in the web browser – doing the actual model inference in the browser.

Open Assistant Model Comparer

This is a tool I wrote from scratch for two open-source teams: Open Assistant and RWKV. Behold its prettiness!

It’s hosted here: https://open-assistant.github.io/oasst-model-eval/ and github code here: https://github.com/Open-Assistant/oasst-model-eval

You pass it the urls of json files that are in a specific format, each containing the inference output of a model for various prompts. It then collates these, and presents them in a way to let you easily compare them. You can also drag-and-drop local files for comparison.

Update: Now with syntax highlighting, math support, markdown support,, url-linking and so much more:

And latex:

and recipes:

Javascript RWKV inference

This is an especially cool project. RWKV is a RNN LLM. A guy in the team, josephrocca, got it running in the browser, as in doing the actual inference in the browser, by running the python code in Wasm (WebAssembly).

I worked on cleaning up the code, and making it a library suitable for other projects to use.

Project is here: https://github.com/josephrocca/rwkv-v4-web

Then I looked inference decoding:

When we run inference on a model, at each step the model is providing confidence values for each token, and from those confidence values we pick a particular token, before repeating for the next token.

We could pick token with the highest confidence, aka greedy search. But this has various downsides – we will never pick tokens that are also valid (and perhaps more rarer and more interesting), and in smaller LLM results in very repetitive output.

I implemented some alternatives:

  • multinomial sampling. Take the top few more confident outputs, apply a softmax and normalize to produce probabilities, then treat them as probabilities and sample from them. We can apply a temperature, to make lower-probable results more or less likely.
  • Repetitive penalty. We can explicitly look back at the tokens that we’ve generated so far, and penalize similar tokens to make them less likely to appear again.

The result:

The screenshots are from a pretty small model. Here’s a comparison from a larger RWKV (RL LLM) model:

Greedy (no penalty):

"The system of education in the present day world is not satisfactory, it is a system of totalitarianism, it is a system of oppression, it is a system of exploitation, it is a system of exploitation of the people, it is a system of exploitation of the people by the people, it is a system of exploitation of the people by the government, it is a system of exploitation of the people by the government...(repeated text)"

Greedy+Repetition penalty:

"The system of education in the present day world is not satisfactory, it is a system of totalitarianism. The system of modern education is based on the theory of class struggle and the aim of this theory is to control and enslave people. The purpose of this system is to make people submissive and docile. It makes them accept all kinds of injustices, crimes and atrocities as normal conditions. It forces people to accept their own degradation as a result of their own acts. This kind of mentality makes it impossible for any person to think independently, for him to be critical about his own condition, his environment or his culture. "

Sampling top-p=0.8 temperature=1:

"and say something about schools and their role in today's world. I will try to show there are shortcomings in both the system and people. Some flaws of the system are as follows: There are many unruly and disobedient people, especially males, who cannot be controlled. It is impossible for those who lack common sense and insight to appreciate the significance of any decision made with a view to the overall good, such as the deployment of new technologies and machinery. Uneducated persons cannot make the right decisions. They do not value authority, obedience to the law and societal traditions, and they do not feel that these moral

Most people preferred the Greedy + Repetition penalty output.

Custom-trained GPT-3 / GPT-4 – Vector database with vector search

Say you have a bunch of company-specific documents/pages/pdfs that you want a user to be able to query with GPT-3.

Without actually pre-training GPT-3, you can get an effect that is pretty close by using embeddings.

Note: Code is available on github: https://github.com/johnflux/gpt3_search

Here’s how:

  1. Ahead of time, generate an embedding vector for each of your company-specific document / page / pdf . You can even break up large documents and generate an embedding vector for each chunk individually.

    For example for a medical system:

    Gastroesophageal reflux disease (GERD) occurs when stomach acid repeatedly flows back into the tube connecting your mouth and stomach (esophagus). This backwash (acid reflux) can irritate the lining of your esophagus.
    Many people experience acid reflux from time to time. However, when acid reflux happens repeatedly over time, it can cause GERD.
    Most people are able to manage the discomfort of GERD with lifestyle changes and medications. And though it’s uncommon, some may need surgery to ease symptoms.

    ↦ [0.22, 0.43, 0.21, 0.54, 0.32……]
  2. When a user asks a question, generate an embedding vector for that too:
    “I’m getting a lot of heartburn.” or
    “I’m feeling a burning feeling behind my breast bone”
    or
    “I keep waking up with a bitter tasting liquid in my mouth”

    ↦ [0.25, 0.38, 0.24, 0.55, 0.31……]
  3. Compare the query vector against all the document vectors and find which document vector is the closest (cosine or manhatten is fine). Show that to the user:

    User: I'm getting a lot of heartburn.
    Document: Gastroesophageal reflux disease (GERD) occurs when stomach acid repeatedly flows back into the tube connecting your mouth and stomach (esophagus). This backwash (acid reflux) can irritate the lining of your esophagus.
    Many people experience acid reflux from time to time. However, when acid reflux happens repeatedly over time, it can cause GERD.
    Most people are able to manage the discomfort of GERD with lifestyle changes and medications. And though it's uncommon, some may need surgery to ease symptoms.


    Pass that document and query to GPT-3, and ask it to reword it to fit the question:

Implementation code

Note: Code is available on github: https://github.com/johnflux/gpt3_search


Generate embeddings

Here’s an example in node javascript:

const { Configuration, OpenAIApi } = require('openai');
const fs = require('fs');

const configuration = new Configuration({
  apiKey: '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',
});
const openai = new OpenAIApi(configuration);

/** Generates embeddings for all guides in the database
 *  Then run:
 *   node generate_embeddings.js ./prod_backup_20230119.json > embeddings.json
 *   node embeddings_search.js ./embeddings.json
 */

var filenames = process.argv.slice(2);

if (filenames.length === 0) {
  console.log('Usage: node generate_embeddings_from_txt_files.js *.txt > embeddings.json');
  process.exit(1);
}

async function run() {
  const embeddings = [];
  for (const filename of filenames) {
    const input = fs.readFileSync(filename, 'utf8');

    try {
      const response = await openai.createEmbedding({ input, model: 'text-embedding-ada-002' });
      const output = { embedding: response.data.data[0].embedding };
      embeddings.push(output);
      console.error('Success:', filename);
    } catch (e) {
      console.error('Failed:', filename);
    }
  }
  console.log(JSON.stringify(embeddings, null, 2));
}

run();

This outputs a file like:

[
{"filename": "gerd.txt", "embedding": [-0.021665672, 0.00097308296, 0.027932819, -0.027959095,....<snipped>]},
....
]

Do search

const { Configuration, OpenAIApi } = require('openai');

const configuration = new Configuration({
  apiKey: 'YOUR-API-KEY',
});
const openai = new OpenAIApi(configuration);

var jsonfilename = './embeddings.json';
const searchTerm = process.argv.slice(2).join(' ');

if (!searchTerm) {
  console.log('Usage: node embeddings_search.js diabetes referral guide');
  process.exit(1);
}

const data = require('./' + jsonfilename);

async function run() {
  const response = await openai.createEmbedding({ input: searchTerm, model: 'text-embedding-ada-002' });
  const embedding = response.data.data[0].embedding;
  const results = getScores(data, embedding);

  console.log(results.map((a) => `${a.score.toFixed(2)}: ${a.filename}]`).join('\n'));
}
run();

function getScores(data, embedding) {
  const results = data
    .map((doc) => ({ score: cosinesim(doc.embedding, embedding), doc }))
    .sort((a, b) => b.score - a.score)
    .filter((doc, index) => index < 3 || (index < 10 && doc.score > 0.7));

  const titles = results.map((doc) => doc.doc.title);
  const results_uniq = results.filter((x, index) => titles.indexOf(x.doc.title) === index);
  return results_uniq;
}

function cosinesim(A, B) {
  var dotproduct = 0;
  var mA = 0;
  var mB = 0;
  for (let i = 0; i < A.length; i++) {
    dotproduct += A[i] * B[i];
    mA += A[i] * A[i];
    mB += B[i] * B[i];
  }
  mA = Math.sqrt(mA);
  mB = Math.sqrt(mB);
  var similarity = dotproduct / (mA * mB);
  return similarity;
}

You can now run like:

./node embeddings_search.js I am getting a lot of heartburn.

And it will output the filename of the closest document. You can display this directly the user user, or pass it to GPT-3 to fine tune the output.

Learning Machine Learning

Heh, funny title.

I became interested in machine learning working for Nokia.  I worked on Nokia’s Z Launcher application for Android.  You can scribble a letter (or multiple), and it would recognize it and search for it.  The app is available for download in the Play Store.

screenshot_2014-06-20-17-16-41

I worked on the Nokia Z Launcher’s handwriting recognition

Specifically I was tasked with optimizing the speed of the recognition.  I don’t know if I can state any specifics on how the character recognition was done, but I will say that I managed to increase the speed of the recognition a hundred fold.

But this recognition was actually a relatively simple task, compared to modern day deep neural networks, but it really whet my appetite to understand more.

When Alpha Go beat Lee Sedol, I knew that I simply must understand Deep Neural Networks.

Below is my journey in understanding, along with my reflective thoughts:

  1. I started off naively implementing an LSTM neural network without any training.
    I wanted to have a crude appreciation of the problems before I read about the solutions.  My results are documented in my previous post here, but looking back it’s quite embarrassing to read.  I don’t regret doing this at all however.
  2. Next I did the Andrew Ng Coursera Machine Learning course.
    This is an 11 week course in the fundamentals.  I completed it along with all the coursework, but by the end I felt that my knowledge was about 20 years out of date.  It was really nice to get the fundamentals, but none of the modern discoveries were discussed at all.  Nothing about LSTM, or Markov Chains, or dropouts, etc.
    The exercises are also all done in Matlab/Octave, which has fallen out of favour and uses a lot of support code.  I certainly didn’t feel comfortable in implementing a neural network system from scratch after watching the course.

    coursera_machine_learning_certificate

    Passed the Coursera Machine learning course with 97.6% score.

    The lecturer, Andrew Ng, was absolutely awesome.  My complaints, really, boil down to that I wish the course was twice as long and that I could learn more from him!  I now help out in a machine learning chat group and find that most of the questions that people ask about TensorFlow, Theano etc are actually basics that are answered very well by Andrew Ng’s course.  I constantly direct people to the course.

  3. Next, Reinforcement Learning.  I did a 4 week Udacity course UD820.
    This was a done by a pair of teachers that are constantly joking with each other.  I first I thought it would annoy me, but I actually really enjoyed the course – they work really well together.  They are a lot more experienced and knowledgeable than they pretend to be.  (They take it in turns to be ignorant, to play the role of a student).
  4. I really wanted to learn about TensorFlow, so I did a 4 week Udacity Course UD730
    Again I did all the coursework for it.  I thought the course was pretty good, but it really annoyed me that each video was about 2 minutes long!  Resulting in a 30 second pause every 2 minutes while it loaded up the next video.  Most frustrating.
  5. At this point, I started reading papers and joined up for the Visual Doom AI competition.
    I have come quite far in my own Visual Doom AI implementation, but the vast majority of the work is the ‘setup’ required.  For example, I had to fix bugs in their doom engine port to get the built-in AI to work.  And it was a fair amount of work to get the game to run nicely with TensorFlow, with mini-batch training, testing and verification stages.
    I believe I understand how to properly implement a good AI for this, with the key coming from guided policy search, in a method similar to that pioneered by google for robotic control.  (Link is to a keynote at the International Conference on Learning Representations 2016).   The idea is to hack the engine to give me accurate internal data of the positions of enemies, walls, health, etc that I can use to train a very simple ‘teacher’.  Then use that teacher to supervise a neural network that has only visual information, thus allowing us to train a deep neural network with back-propagation.  By alternating between teacher and student, we can converge upon perfect solutions.  I hope to write more about this in a proper blog post.
  6. The International Conference on Learning Representations (ICLR) 2016
    Videos were absolutely fascinating, and were surprisingly easy to follow with the above preparation.
  7. I listened to the entire past two years of the The Talking Machines podcast.  It highlighted many areas that I was completely unfamiliar with, and highlighted many things that I knew about but just didn’t realise were important.
  8. I did the Hinton Coursera course on Neural Networks for Machine Learning, which perfectly complemented the Andrew Ng Coursera Course.  I recommend these two courses the most, for the foundations.  It is about 5 years out of date, but is all about the fundamentals.
    coursera
  9. I did the Computational Neuroscience course.  The first half was interesting and was about, well, neuroscience.  But the math lectures were in a slow monotonic tone that really put me straight to sleep.  The second half of the course was just a copy of the Andrew Ng course (they even said so), so I just skipped all the lectures and did the exams with no problems.  I really liked that they let you do the homework in python.  It is easier in matlab, but I really wanted to improve my python datascience skills.  The exams took way way longer than their predicted times.  They would have 20 questions, requiring you to download, run and modify 4 programs, and say that you should be able to do it in an hour!
    coursera
  10. I completed the IBM Artificial Intelligence class.  This is the second most expensive AI class that I’ve done, at $1600 plus $50 for the book.   I was really not at all impressed by it.  Most of the videos are 1 minute long, or less, each.  Which means that I’m spending half the time waiting for the next video to load up.  The main lecturer gets on my own nerves – he wears a brightly colored Google Glass for no apparent reason other than to show it off, and speaks in the most patronizing way.  You get praised continually for signing up to the course at the start.  It’s overly specialized.  You use their specific libraries which aren’t at all production ready, and you use toy data with a tiny number of training examples.  (e.g. trying to train for the gesture ‘toy’ with a single training example!).  Contrast this against the Google Self Driving course:
  11. The Google Self Driving course, which is $2400.  This is much better than the IBM course.  The main difference is that they’ve made the theme self driving cars, but you do it all in TensorFlow and you learn generic techniques that could be applied to any machine learning field.  You quickly produce code that could be easily made production ready.  You work with large realistic data, with large realistic neural networks, and they teach you use the Amazon AWS servers to train the data with.  The result is code that can be (and literally is!) deployed to a real car.
    nd013

Tensorflow for Neurobiologists

I couldn’t find anyone else that has done this, so I made this really quick guide.  This uses tensorflow which is a complete overkill for this specific problem, but I figure that a simple example is much easier to follow.

Install and run python3 notebook, and tensorflow.  In Linux, as a user without using sudo:

$ pip3 install --upgrade --user ipython[all] tensorflow matplotlib
$ ipython3  notebook

Then in the notebook window, do New->Python 3

Here’s an example I made earlier. You can download the latest version on github here: https://github.com/johnflux/Spike-Triggered-Average-in-TensorFlow

Spike Triggered Average in TensorFlow

The data is an experimentally recorded set of spikes recorded from the famous H1 motion-sensitive neuron of the fly (Calliphora vicina) from the lab of Dr Robert de Ruyter van Steveninck.

This is a complete rewrite of non-tensorflow code in the Coursera course Computational Neuroscience by University of Washington. I am thoroughly enjoying this course!

Here we use TensorFlow to find out how the neuron is reacting to the data, to see what causes the neuron to trigger.

%matplotlib inline
import pickle
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()

FILENAME = 'data.pickle'

with open(FILENAME, 'rb') as f:
    data = pickle.load(f)

stim = tf.constant(data['stim'])
rho = tf.constant(data['rho'])
sampling_period = 2 # The data was sampled at 500hz = 2ms
window_size = 150 # Let's use a 300ms / sampling_period sliding window

We now have our data loaded into tensorflow as a constant, which means that we can easily ‘run’ our tensorflow graph. For example, let’s examine stim and rho:

print("Spike-train time-series =", rho.eval(),
      "\nStimulus time-series     =", stim.eval())
Spike-train time-series = [0 0 0 ..., 0 0 0] 
Stimulus time-series    = [-111.94824219  -81.80664062 
    10.21972656 ...,  9.78515625 24.11132812 50.25390625]

rho is an binary array where a 1 indicates a spike. Let’s turn that into an array of indices of where the value is 1, but ignoring the first window_size elements.

Note: We can use the [] and + operations on a tensorflow variable, and it correctly adds those operations to the graph. This is equivalent to using the tf.slice and tf.add operations.

spike_times = tf.where(tf.not_equal(rho[window_size:-1], 0)) + window_size
print("Time indices where there is a spike:\n", spike_times.eval())
Time indices where there is a spike:
 [[   158]
 [   160]
 [   162]
 ..., 
 [599936]
 [599941]
 [599947]]
def getStimWindow(index):
    i = tf.cast(index, tf.int32)
    return stim[i-window_size+1:i+1]
stim_windows = tf.map_fn(lambda x: getStimWindow(x[0]), spike_times, dtype=tf.float64)
spike_triggered_average = tf.reduce_mean(stim_windows, 0).eval()
print("Spike triggered averaged is:", spike_triggered_average[0:5], "(truncated)")
Spike triggered averaged is: [-0.33083048 -0.29083503 -0.23076012 -0.24636984 -0.10962767] (truncated)

Now let’s plot this!

time = (np.arange(-window_size, 0) + 1) * sampling_period
plt.plot(time, spike_triggered_average)
plt.xlabel('Time (ms)')
plt.ylabel('Stimulus')
plt.title('Spike-Triggered Average')

plt.show()

output_8_0

It’s… beautiful!

What we are looking at here, is that we’ve discovered that our neuron is doing a leaky integration of the stimulus. And when that integration adds up to a certain value, it triggers.

Do see the github repo for full source: https://github.com/johnflux/Spike-Triggered-Average-in-TensorFlow

Update: I was curious how much noise there was. There’s a plot with 1 standard deviation in light blue:

mean, var = tf.nn.moments(stim_windows,axes=[0])
plt.errorbar(time, spike_triggered_average, yerr=tf.sqrt(var).eval(), ecolor="#0000ff33")

spike2

Yikes!  This is why the input signal MUST be Gaussian, and why we need lots of data to average over.  For this, we’re averaging over 53583 thousand windows.

Coursera Neural Networks for Machine Learning

I’ve completed another 4 month Coursera Machine Learning course. The first was the Andrew NG’s Machine Learning course, and the second is Geoffrey Hinton’s Neural Networks for Machine Learning course.  I also completed a 2 month Coursera course in Computation Neuroscience.

coursera.png

Course certificate

Both courses use Matlab, and this second one was significantly more work and buggier. Several of the exam questions had key bits of information accidentally left out and referred to images that didn’t exist. Only by referring to the user forums could the key bits of information be found out. However I was one of the first people to do this course, so bugs are to be expected.

Technical issues aside, I enjoyed the course a lot. It’s approximately 5 years behind the current state of the field, but the information is still very relevant.

Edit: Update on 2017/2/5 – I also did the 2 month Computational Neuroscience course:

coursera

I enjoyed the first half of the course, which was all new and interesting to me.  But the second half was just a basic version of the Andrew Ng course.

 

Dilation Animation

There is a neat git repository that generates animations of various different settings. But unfortunately it didn’t show dilation – which is one of my favourite Convolutional Neural Network techniques.

Edit: My changes are now merged

So I added support.

And here’s what my patch generates:  The bottom layer is the input layer, the top is the output layer.

dilation

No stride, no padding, dilation of 2, kernel size of 3

The idea with dilation is that at each layer you double the dilation size.  That way your receptive field grows exponentially with each layer, allowing you to deal with large images without pooling and thus have pixel sharpness.

Second order back propagation

This is a random idea that I’ve been thinking about. A reader messaged me to say that this look similar to online l-bgfs. To my inexperienced eyes, I can’t see this myself, but I’m still somewhat a beginner.

photo

Yes, I chose a cat example simply so that I had an excuse to add cats to my blog.

Say we are training a neural network to take images of animals and classify the image as being an image of a cat or not a cat.

You would train the network to output, say, y_{cat} = 1 if the image is that of a cat.

To do so, we can gather some training data (images labelled by humans), and for each image we see what our network predicts (e.g. “I’m 40% sure it’s a cat”).  We compare that against what a human says (“I’m 100% sure it’s a cat”)  find the squared error (“We’re off by 0.6, so our squared error is 0.6^2”) and adjust each parameter, w_i, in the network so that it slightly decreases the error (\Delta w_i  = -\alpha \partial E/\partial w_i).  And then repeat.

It’s this adjustment of each parameter that I want to rethink.  The above procedure is Stochastic Gradient Descent (SGD) – we adjust w_i to reduce the error for our test set (I’m glossing over overfitting, minibatches, etc).

Key Idea

This means that we are also trying to look for a local minimum. i.e. that once trained, we want the property that if we varied any of the parameters w_i by a small amount then it should increase the expected squared error E

My idea is to encode this into the SGD update. To find a local minima for a particular test image we want:

\dfrac{\partial y_{cat}}{\partial w_i} = 0

\dfrac{\partial^2y_{cat}}{\partial w_i^2}  < 0 (or if it equals 0, we need to consider the third differential etc).

Let’s concentrate on just the first criteria for the moment. Since we’ve already used the letter E to mean the half squared error of y, we’ll use F to be the half squared error of \dfrac{\partial y_{cat}}{\partial w_i}.

So we want to minimize the half squared error F:

F = \dfrac{1}{2}\left(\dfrac{\partial y_{cat}}{\partial w_i}\right)^2

So to minimize we need the gradient of this error:

\dfrac{\partial F}{\partial w_i} = \dfrac{1}{2} \dfrac{\partial}{\partial w_i} \left(\dfrac{\partial y_{cat}}{\partial w_i}\right)^2 = 0

Applying the chain rule:

\dfrac{\partial F}{\partial w_i} = \dfrac{\partial y_{cat}}{\partial w_i} \dfrac{\partial^2 y_{cat}}{\partial w_i^2} = 0

SGD update rule

And so we can modify our SGD update rule to:

\Delta w_i = -\alpha \partial E/\partial w_i - \beta \dfrac{\partial y_{cat}}{\partial w_i} \dfrac{\partial^2 y_{cat}}{\partial w_i^2}

Where \alpha and \beta are learning rate hyperparameters.

Conclusion

We finished with a new SGD update rule. I have no idea if this actually will be any better, and the only way to find out is to actually test. This is left as an exercise for the reader 😀

Best-fitting to a Cumulative distribution function in python TensorFlow

I wanted to find a best fit curve for some data points when I know that the true curve that I’m predicting is a parameter free Cumulative Distribution Function.  I could just do a linear regression on the points, but the resulting function, y might not have the properties that we desire from a CDF, such as:

  • Monotonically increasing
  • y(x) \xrightarrow{x\to\infty} 1      i.e. That it tends to 1 as x approaches positive infinity
  • y(x) \xrightarrow{x\to-\infty} 0      i.e. That it tends to 0 as x approaches negative infinity

First, I take the second two points.  To deal with this, I use the sigmoid function to create a new parameter, x’:

x'(x) = \textrm{sigmoid}(x)

This has the nice property that x'(x) \xrightarrow{x\to\infty} 1 and x'(x) \xrightarrow{x\to-\infty} 0

So now we can find a best fit polynomial on x'.

So:

y(x') = a + bx' + cx'^2 + dx'^3 + .. + zx'^n

But with the boundary conditions that:

y(0) = 0 and y(1) = 1

Which, solving for the boundary conditions, gives us:

y(0) = 0 = a
b = (1-c-d-e-...-z)

Which simplifies our function to:

y(x') = (1-c-d-e-..-z)x' + cx'^2 + dx'^3 + ex'^4 .. + zx'^n

Implementing this in tensorflow using stochastic gradient descent (code is below) we get:

figure_1-2

20 data samples

figure_1-1

100 data samples

(The graph title equation is wrong.  It should be   x' = sigmoid(x) then y(x') = ax' +...  I was too lazy to update the graphs sorry)

Unfortunately this curve doesn’t have one main property that we’d like – it’s not monotonically increasing – it goes above 1. I thought about it for a few hours and couldn’t think of a nice solution.

I’m sure all of this has been solved properly before, but with a quick google I couldn’t find anything.

Edit: I’m updating this 12 days later. I have been thinking in the background about how to enforce that I want my function to be monotonic. (i.e. always increasing/decreasing) I was certain that I was just being stupid and that there was a simple answer. But by complete coincidence, I watched the youtube video Breakthroughs in Machine Learning – Google I/O 2016 in which the last speaker mentioned this restriction.  It turns out that this a very difficult problem – with the first solutions having a time complexity of O(N^4):

I didn’t understand what exactly google’s solution for this is, but they reference their paper on it:  Monotonic Calibrated Interpolated Look-up Tables, JMLR 2016.  It seems that I have some light reading to do!

#!/usr/bin/env python

import tensorflow as tf

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from scipy.stats import norm

# Create some random data as a binned normal function
plt.ion()
n_observations = 20
fig, ax = plt.subplots(1, 1)
xs = np.linspace(-3, 3, n_observations)

ys = norm.pdf(xs) * (1 + np.random.uniform(-1, 1, n_observations))

ys = np.cumsum(ys)

ax.scatter(xs, ys)
fig.show()
plt.draw()

highest_order_polynomial = 3

# We have our data now, so on with the tensorflow
# Setup the model graph

# Our input is an arbitrary number of data points (That's what the 'None dimension means)
# and each input has just a single value which is a float

X = tf.placeholder(tf.float32, [None])
Y = tf.placeholder(tf.float32, [None])

# Now, we know data fits a CDF function, so we know that
# ys(-inf) = 0   and ys(+inf) = 1

# So let's set:

X2 = tf.sigmoid(X)

# So now X2 is between [0,1]

# Let's now fit a polynomial like:
#
#   Y = a + b*(X2) + c*(X2)^2 + d*(X2)^3 + .. + z(X2)^(highest_order_polynomial)
#
# to those points.  But we know that since it's a CDF:
#   Y(0) = 0 and y(1) = 1
#
# So solving for this:
#   Y(0) = 0 = a
#   b = (1-c-d-e-...-z)
#
# This simplifies our function to:
#
#   y = (1-c-d-e-..-z)x + cx^2 + dx^3 + ex^4 .. + zx^(highest_order_polynomial)
#
# i.e. we have highest_order_polynomial-2  number of weights
W = tf.Variable(tf.zeros([highest_order_polynomial-1]))

# Now set up our equation:
Y_pred = tf.Variable(0.0)
b = (1 - tf.reduce_sum(W))
Y_pred = tf.add(Y_pred, b * tf.sigmoid(X2))

for n in xrange(2, highest_order_polynomial+1):
    Y_pred = tf.add(Y_pred, tf.mul(tf.pow(X2, n), W[n-2]))

# Loss function measure the distance between our observations
# and predictions and average over them.
cost = tf.reduce_sum(tf.pow(Y_pred - Y, 2)) / (n_observations - 1)

# Regularization if we want it.  Only needed if we have a large value for the highest_order_polynomial and are worried about overfitting
# It's also not clear to me if regularization is going to be doing what we want here
#cost = tf.add(cost, regularization_value * (b + tf.reduce_sum(W)))

# Use stochastic gradient descent to optimize W
# using adaptive learning rate
optimizer = tf.train.AdagradOptimizer(100.0).minimize(cost)

n_epochs = 10000
with tf.Session() as sess:
    # Here we tell tensorflow that we want to initialize all
    # the variables in the graph so we can use them
    sess.run(tf.initialize_all_variables())

    # Fit all training data
    prev_training_cost = 0.0
    for epoch_i in range(n_epochs):
        sess.run(optimizer, feed_dict={X: xs, Y: ys})

        training_cost = sess.run(
            cost, feed_dict={X: xs, Y: ys})

        if epoch_i % 100 == 0 or epoch_i == n_epochs-1:
            print(training_cost)

        # Allow the training to quit if we've reached a minimum
        if np.abs(prev_training_cost - training_cost) &amp;lt; 0.0000001:
            break
        prev_training_cost = training_cost
    print &quot;Training cost: &quot;, training_cost, &quot; after epoch:&quot;, epoch_i
    w = W.eval()
    b = 1 - np.sum(w)
    equation = &quot;x' = sigmoid(x), y = &quot;+str(b)+&quot;x' +&quot; + (&quot; + &quot;.join(&quot;{}x'^{}&quot;.format(val, idx) for idx, val in enumerate(w, start=2))) + &quot;)&quot;;
    print &quot;For function:&quot;, equation
    pred_values = Y_pred.eval(feed_dict={X: xs}, session=sess)
    ax.plot(xs, pred_values, 'r')
    plt.title(equation)

fig.show()
plt.draw()
#ax.set_ylim([-3, 3])
plt.waitforbuttonpress()

Logistic Regression and Regularization

Tons has been written about regularization, but I wanted to see it for myself to try to get an intuitive feel for it.

I loaded a dataset from google into python (a set of images of letters) and implemented a double for-loop to run a logistic regression with different test data sizes, and different regularization parameters.  (The value shown in the graph is actually 1/regularization ).

 

def doLogisticRegression(trainsize, regularizer):
    fitmodel = linear_model.LogisticRegression(C=regularizer)
    train_datasubset = train_dataset[0:trainsize,:,:].reshape(trainsize, -1)
    x = fitmodel.fit(train_dataset[0:trainsize,:,:].reshape(trainsize, -1),
                 train_labels[0:trainsize])
    return [fitmodel.score(train_datasubset, train_labels[0:trainsize]),
            fitmodel.score(valid_dataset.reshape(valid_dataset.shape[0], -1), valid_labels)
            ]
print(train_dataset.shape[0])
trainsizes = [50,200,300,400,500,600,700,800,900,1000,2000,3000,4000,5000, 10000, 50000, 100000, 200000];
plt.xscale('log')
color = 0
plots = []
for regularizer in [1, 0.1, 0.01, 0.001]:
    results = np.array([doLogisticRegression(x, regularizer) for x in trainsizes])
    dashedplot = plt.plot(trainsizes, results[:,1], '--', label=(&amp;quot;r:&amp;quot; + str(regularizer)))
    plt.plot(trainsizes, results[:,0], c=dashedplot[0].get_color(), label=(&amp;quot;r:&amp;quot; + str(regularizer)))
    plots.append(dashedplot[0])
plt.legend(loc='best', handles=plots)

The result is very interesting. The solid line is the training set accuracy, and the dashed line is the validation set accuracy. The vertical axis is the accuracy rate (percentage of images recognized as the correct letter) and the horizontal axis is the number of training examples.

graph of accuracy against training set for logistic regression

Image to letter recognition accuracy against training size, for various values of r = 1/regularization_factor.  Solid line is training set accuracy, dotted line is validation set accuracy.

First, I find it fascinating that purely a logistic regression can produce an accuracy of recognizing letters at 82%. If you added in spell checking, and ran this over an image, you could probably get a pretty decent OCR system, from purely an logistical regression.

Second, it’s interesting to see the effect of the regularization term. At less than about 500 training examples, the regularization term only hurts the algorithm. (A value of 1 means no regularization). At about 500 training examples though, the strong regularization (really helps). As the number of training examples increases, regularization makes less and less of an impact, and everything converges at around 200,000 training samples.

It’s quite clear at this point, at 200,000 training samples, that we are unlikely to get more improvements with more training samples.

A good rule of thumb that I’ve read is that you need approximately 50 training samples per feature. Since we have 28×28 = 784 features, this would be at 40,000 training samples which is actually only a couple of percent from our peak performance at 200,000 training samples (which is 2000000/784=2551 training samples per feature).

At this point, we could state fairly confidently that we need to improve the model if we want to improve performance.

Stochastic Gradient Descent

I reran with the same data but with stochastic gradient descent (batch size 128) and no regularization.  The accuracy (after 9000 runs) on the validation set was about the same as the best case with the logistic regression (81%), but took only a fraction of the time to run.  It took just a few minutes to run, verses a few hours for the logistic regression.

Stochastic Gradient Descent with 1 hidden layer

I added a hidden layer (of size 1024) and reran. The accuracy was only marginally better (84%).  Doubling the number of runs increased this accuracy to 86%.

 

 

Exploiting internal neural network symmetries

Neural networks have many internal symmetries.  A symmetry is an operation that you can apply to the weights such that for all inputs the outputs are the same.

The trivial symmetry is the identity. If you multiply all the weights in a neural network by 1, then the output remains unchanged.

We can also move nodes around by swapping the weight in any hidden layer.  For example:

Diagram1

But what other symmetries are there?  How is this useful?

Well, by itself, it’s not.  But let’s get hand-wavy.  Let’s say that we are using a ReLU activation function.  What this means is that at our hidden node, if the final value is greater than 0, then the output at that node is the same value.  if it’s negative, then the output is 0.

Let’s hand wave like crazy, and consider only the case where the node is activated (i.e the total value going into node is >=0).  The output is equal to the input, and in this region it is linear.

So, ignoring the non-linear case, we can find an interesting symmetry.  We can project to the basis   [1,1]/\sqrt{2} and [1,-1]/\sqrt{2}.  Or in pictures:

Diagram2

So, ignoring the non-linearity at the hidden layer, we find that we can actually weights in a more interesting fashion.

Now, what can we do with this?

Honestly, I’m not sure.  One idea I had is if you’ve optimized a neural net and reached the peak that your training set can give you, you could apply this swap operation to random nodes and retrain and see if it helps.  The effect is that you’re leaving the weights the same, but changing when the node activates.  Hopefully I’ll have a better idea later.