This is a random idea that I’ve been thinking about. A reader messaged me to say that this look similar to online l-bgfs. To my inexperienced eyes, I can’t see this myself, but I’m still somewhat a beginner.

Yes, I chose a cat example simply so that I had an excuse to add cats to my blog.

Say we are training a neural network to take images of animals and classify the image as being an image of a cat or not a cat.

You would train the network to output, say, $y_{cat} = 1$ if the image is that of a cat.

To do so, we can gather some training data (images labelled by humans), and for each image we see what our network predicts (e.g. “I’m 40% sure it’s a cat”). We compare that against what a human says (“I’m 100% sure it’s a cat”) find the squared error (“We’re off by 0.6, so our squared error is 0.6^2”) and adjust each parameter, $w_i$ , in the network so that it slightly decreases the error ( $\Delta w_i = -\alpha \partial E/\partial w_i$ ). And then repeat.

It’s this adjustment of each parameter that I want to rethink. The above procedure is Stochastic Gradient Descent (SGD) – we adjust $w_i$ to reduce the error for our test set (I’m glossing over overfitting, minibatches, etc).

Key Idea

This means that we are also trying to look for a local minimum. i.e. that once trained, we want the property that if we varied any of the parameters $w_i$ by a small amount then it should increase the expected squared error $E$

My idea is to encode this into the SGD update. To find a local minima for a particular test image we want:

$\dfrac{\partial y_{cat}}{\partial w_i} = 0$

$\dfrac{\partial^2y_{cat}}{\partial w_i^2} < 0$ (or if it equals 0, we need to consider the third differential etc).

Let’s concentrate on just the first criteria for the moment. Since we’ve already used the letter $E$ to mean the half squared error of $y$ , we’ll use $F$ to be the half squared error of $\dfrac{\partial y_{cat}}{\partial w_i}$ .

So we want to minimize the half squared error $F$ :

$F = \dfrac{1}{2}\left(\dfrac{\partial y_{cat}}{\partial w_i}\right)^2$

So to minimize we need the gradient of this error:

$\dfrac{\partial F}{\partial w_i} = \dfrac{1}{2} \dfrac{\partial}{\partial w_i} \left(\dfrac{\partial y_{cat}}{\partial w_i}\right)^2 = 0$

Applying the chain rule:

$\dfrac{\partial F}{\partial w_i} = \dfrac{\partial y_{cat}}{\partial w_i} \dfrac{\partial^2 y_{cat}}{\partial w_i^2} = 0$

SGD update rule

And so we can modify our SGD update rule to:

$\Delta w_i = -\alpha \partial E/\partial w_i - \beta \dfrac{\partial y_{cat}}{\partial w_i} \dfrac{\partial^2 y_{cat}}{\partial w_i^2}$

Where $\alpha$ and $\beta$ are learning rate hyperparameters.

Conclusion

We finished with a new SGD update rule. I have no idea if this actually will be any better, and the only way to find out is to actually test. This is left as an exercise for the reader 😀

	Bilal on Update: Release! ChatGPT clone…
	moon on Nokia 6110 Part 3 –…
	Lee V on Adobe After Effects
	John Tapsell on Nokia 6110 Part 3 –…
	Quackleb on Nokia 6110 Part 3 –…

John Tapsell

Director of Flux Programming Ltd – Available for hire! See 'About'!

Month: August 2016

Second order back propagation

Conclusion