Acyclic Monotiles

Everyone in the math community is all excited about the acylic monotiles that were discovered.

I don’t have much to add, but I did recreate it in Solidworks and laser cut it out:

I created a hexagon, and subdivided with construction lines, and saved that as a block.

I then inserted 5 more of these blocks:

Note that I think the ‘center’ doesn’t actually need to be in the center, and it will still tile, for more interesting shapes:

Then I draw the shape I wanted, shown in orange, and saved this as a block.

Then I created TWO new blocks. In the first block, I inserted that orange outline block, and drew on the shirt pattern. Then in the second block, I insert the same block, but mirrored it and drew on the back of the t-shirt.

Then I could insert a whole bunch of the two blocks, and manually arrange them together like a jigsaw, snapping edges together.

Then saved it as a DXF file and imported it into Inkscape, manually moving the cut lines and score lines to separate layers and giving them separate colors. I also had to manually delete overlapping lines. I’m not aware of a better approach.

Then cut on my laser cutter:

Advertisement

Wifi / bluetooth RSSI signal strength to distance

Imagine you have a bluetooth device somewhere in your house, and you want to try to locate it.  You have several other devices in the house that can see it, and so you want to triangulate its position.

This article is about the result I achieved, and the methodology.

I spent two solid weeks working on this, and this was the result:

RSSI Location of a single bluetooth device

Estimating the most probable position of a bluetooth device, based on 7 strength readings.

We are seeing a map, overlaid in blue by the most likely position of a lost bluetooth device.

And a more advanced example, this time with many different devices, and a more advanced algorithm (discussed further below).

Note that some of the devices have very large uncertainties in their position.

Liklihood estimation of many devices

Estimating position of multiple bluetooth devices, based on RSSI strength.

And to tie it all together, a mobile app:

Find It App Screen

Methodology

There are quite a few articles, video and books on estimating the distance of a bluetooth device (or wifi hub) based on knowing the RSSI signal strength readings.

But I couldn’t find any that gave the uncertainty of that estimation.

So here I derive it myself, using the Log-distance_path loss model

rssi = -10n \log_{10}(d) + A + Noise

In this model:

  •  rssi is the signal strength, measured in dB
  • n is a ‘path loss exponent’:
    Path-Loss-Exponent-in-various-environments.png
  • A is the rssi value at 1 meter away
  • Noise is Normally distributed, with mean 0 and variance σ²
  • d is our distance (above) or estimated distance (below)
Rearranging to get an estimated distance, we get:
d = 10^{\dfrac{A - rssi + Noise}{10n}}
Now Noise is sampled from a Normal distribution, with mean = 0 and variance = σ², so let’s write our estimated d as a random variable:
d \sim 10^{\dfrac{A - rssi + N(0,\sigma^2)}{10n}}
Important note: Note that random variable d is distributed as the probability of the rssi given the distance.  Not the probability of the distance given the rssi.  This is important, and it means that we need to at least renormalize the probabilities over all possible distances to make sure that they add up to 1.  See section at end for more details.
Adding a constant to a normal distribution just shifts the mean:
d \sim 10^{\dfrac{N(A - rssi,\sigma^2)}{10n}}
Now let’s have a bit of fun, by switching it to base e.  This isn’t actually necessary, but it makes it straightforward to match up with wikipedia’s formulas later, so:
d \sim e^{\dfrac{N(A - rssi,\sigma^2) \cdot \ln(10)}{10n}}
d \sim e^{N(A - rssi,\sigma^2)\cdot \ln(10)/10n}
d \sim \mathrm{Lognormal}(A - rssi,\sigma^2)^{\frac{\ln(10)}{10n}}
And we thus get our final result:
\boxed{d \sim \mathrm{Lognormal}((A - rssi)\cdot\ln(10)/(10n),\sigma^2\cdot\ln(10)^2/(10n)^2)}

rssi_plot

Distance in meters against probability density, for an rssi value of -80, A=-30, n=3, sigma^2=40

Bayes Theorem

I mentioned earlier:

Important note: Note that random variable d is distributed as the probability of the rssi given the distance.  Not the probability of the distance given the rssi.  This is important, and it means that we need to at least renormalize the probabilities over all possible distances to make sure that they add up to 1.  See section at end for more details.

So using the above graph as an example, say that our measured RSSI was -80 and the probability density for d = 40 meters is 2%.

This means:

P(measured_rssi = -80|distance=40m) = 2%

But we actually want to know is:

P(distance=40m | measured_rssi=-80)

So we need to apply Bayes theorem:

bayes.png

So we have an update rule of:

P(distance=40m | measured_rssi=-80) = P(measured_rssi = -80|distance=40m) * P(distance=40m)

Where we initially say that all distances within an area are equally likely =  P(distance=40m) = 1/area

And we renormalize the probability to ensure that sum of probabilities over all distances = 1.

 

Laplace Transform – visualized

The Laplace Transform is a particular tool that is used in mathematics, science, engineering and so on.  There are many books, web pages, and so on about it.

My animations are now on Wikipedia: https://en.wikipedia.org/wiki/Laplace_transform

And yet I cannot find a single decent visualization of it!  Not a single person that I can find appears to have tried to actually visualize what it is doing.  There are plenty of animations for the Fourier Transform like:

fourier

But nothing for Laplace Transform that I can find.

So, I will attempt to fill that gap.

What is the Laplace Transform?

It’s a way to represent a function that is 0 for time < 0 (typically) as a sum of many waves that look more like:

laplace

Graph of e^t cos(10t)

Note that what I just said isn’t entirely true, because there’s an imaginary component here too, and we’re actually integrating.  So take this as a crude idea just to get started, and let’s move onto the math to get a better idea:

Math

The goal of this is to visualize how the Laplace Transform works:

\displaystyle\mathscr{L}\{f(t)\}=F(s)

To do this, we need to look at the definition of the inverse Laplace Transform: (Using Mellin’s inverse formula: https://en.wikipedia.org/wiki/Inverse_Laplace_transform )

\displaystyle f(t) = \mathscr{L}^{-1}\{F(s)\}=\frac{1}{2\pi j}\int^{c+j\infty}_{c-j\infty} F(s)e^{st}\mathrm{d}s

While pretty, it’s not so nice to work with, so let’s make the substitution:

\displaystyle s := c+jr

so that our new limits are just \infty to -\infty, and \mathrm{d}s/\mathrm{d}r = j giving:

\displaystyle f(t) = \mathscr{L}^{-1}\{F(s)\}=\frac{1}{2\pi j}\int^{\infty}_{-\infty} F(c+jr)e^{(c+jr)t}j\mathrm{d}r

\displaystyle = \frac{1}{2\pi}\int^{\infty}_{-\infty} F(c+jr)e^{(c+jr)t}\mathrm{d}r

Which we will now approximate as:

\displaystyle \approx \frac{1}{2\pi}\sum^{n}_{i=-n} F(c+jr_i)e^{(c+jr_i)t}\Delta r_i

Code

The code turned out to be a bit too large for a blog post, so I’ve put it here:

https://github.com/johnflux/laplace-transform-visualized/blob/master/Laplace%20Transform.ipynb

Results

Note: The graphs say “Next frequency to add: … where s = c+rj“, but really it should be “Next two frequencies to add: … where s = c\pm rj” since we are adding two frequencies at a time, in such a way that their imaginary parts cancel out, allowing us to keep everything real in the plots.  I fixed this comment in the code, but didn’t want to rerender all the videos.

A cubic polynomial:

A cosine wave:

Now a square wave.  This has infinities going to infinity, so it’s not technically possible to plot.  But I tried anyway, and it seems to visually work:

Gibbs Phenomenon

Note the overshoot ‘ringing’ at the corners in the square wave. This is the Gibbs phenomenon and occurs in Fourier Transforms as well. See that link for more information.

Now some that it absolutely can’t handle, like: \delta(t).  (A function that is 0 everywhere, except a sharp peak at exactly time = 0).  In the S domain, this is a constant, meaning that we never converge.  But visually it’s still cool.

Note that this never ‘settles down’ (converges) because the frequency is constantly increasing while the magnitude remains constant.

There is visual ‘aliasing’ (like how a wheel can appear to go backwards as its speed increases). This is not “real” – it is an artifact of trying to render high frequency waves. If we rendered (and played back) the video at a higher resolution, the effect would disappear.

At the very end, it appears as if the wave is just about to converge. This is not a coincidence and it isn’t real. It happens because the frequency of the waves becomes too high so that we just don’t see them, making the line appear to go smooth, when in reality the waves are just too close together to see.

The code is automatically calculating this point and setting our time step such that it only breaksdown at the very end of the video. If make the timestep smaller, this effect would disappear.

And a simple step function:

A sawtooth:

Erasing background from an image

I have two opaque images –  one with an object and a background, and another with just the background.  Like:

I want to subtract the background from the image so that the alpha blended result is visually identical, but the foreground is as transparent as possible.

E.g:

lion

Desired output (All images under Reuse With Modification license)

I’m sure that this must have been, but I couldn’t find a single correct way of doing this!

I asked a developer from the image editor gimp team, and they replied that the standard way is to create an alpha mask on the front image from the difference between the two images.  i.e. for each pixel in both layers, subtract the rgb values, average that difference between the three channels, and then use that as an alpha.

But this is clearly not correct.  Imagine the foreground has a green piece of semi-transparent glass against a red background.  Just using an alpha mask is clearly not going to subtract the background because you need to actually modify the rgb values in the top layer image to remove all the red.

So what is the correct solution?  Let’s do the calculations.

If we have a solution, the for a solid background with a semi-transparent foreground layer that is alpha blended on top, the final visual color is:

out_{rgb} = src_{rgb} * src_{alpha} + dst_{rgb} \cdot (1-src_{alpha})

We want the visual result to be the same, so we know the value of out_{rgb} – that’s our original foreground+background image.  And we know dst_{rgb} – that’s our background image.  We want to now create a new foreground image, src_{rgb}, with the maximum value of src_{alpha}.

So to restate this again – I want to know how to change the top layer src so that I can have the maximum possible alpha without changing the final visual image at all.  I.e. remove as much of the background as possible from our foreground+background image.

Note that we also have the constraint that for each color channel, that src_{rgb} \le 1 since each rgb pixel value is between 0 and 1.  So:

src_{alpha} \le (out_{rgb} - dst_{rgb})/(1-dst_{rgb})

So:

src_{alpha} = Min((out_r - dst_r)/(1-dst_r), out_g - dst_g)/(1-dst_g), out_b - dst_b)/(1-dst_b))\\ src_{rgb} = (dst_{rgb} \cdot (1-src_{alpha}) - out_{rgb})/src_{alpha}

Proposal

Add an option for the gimp eraser tool to ‘remove layers underneath’, which grabs the rgb value of the layer underneath and applies the formula using the alpha in the brush as a normal erasure would, but bounding the alpha to be no more than the equation above, and modifying the rgb values accordingly.

Result

I showed this to the Gimp team, and they found a way to do this with the latest version in git.  Open the two images as layers.  For the top layer do: Layer->Transparency->Add Alpha Channel.  Select the Clone tool.  On the background layer, ctrl click anywhere to set the Clone source.  In the Clone tool options, choose Default and Color erase, and set alignment to Registered.  Make the size large, select the top layer again, and click on it to erase everything.

Result is:

Screenshot_20170307_133539.png

When the background is a very different color, it works great – the sky was very nicely erased.  But when the colors are too similar, it goes completely wrong.

Overall..  a failure.  But interesting.

Second order back propagation

This is a random idea that I’ve been thinking about. A reader messaged me to say that this look similar to online l-bgfs. To my inexperienced eyes, I can’t see this myself, but I’m still somewhat a beginner.

photo

Yes, I chose a cat example simply so that I had an excuse to add cats to my blog.

Say we are training a neural network to take images of animals and classify the image as being an image of a cat or not a cat.

You would train the network to output, say, y_{cat} = 1 if the image is that of a cat.

To do so, we can gather some training data (images labelled by humans), and for each image we see what our network predicts (e.g. “I’m 40% sure it’s a cat”).  We compare that against what a human says (“I’m 100% sure it’s a cat”)  find the squared error (“We’re off by 0.6, so our squared error is 0.6^2”) and adjust each parameter, w_i, in the network so that it slightly decreases the error (\Delta w_i  = -\alpha \partial E/\partial w_i).  And then repeat.

It’s this adjustment of each parameter that I want to rethink.  The above procedure is Stochastic Gradient Descent (SGD) – we adjust w_i to reduce the error for our test set (I’m glossing over overfitting, minibatches, etc).

Key Idea

This means that we are also trying to look for a local minimum. i.e. that once trained, we want the property that if we varied any of the parameters w_i by a small amount then it should increase the expected squared error E

My idea is to encode this into the SGD update. To find a local minima for a particular test image we want:

\dfrac{\partial y_{cat}}{\partial w_i} = 0

\dfrac{\partial^2y_{cat}}{\partial w_i^2}  < 0 (or if it equals 0, we need to consider the third differential etc).

Let’s concentrate on just the first criteria for the moment. Since we’ve already used the letter E to mean the half squared error of y, we’ll use F to be the half squared error of \dfrac{\partial y_{cat}}{\partial w_i}.

So we want to minimize the half squared error F:

F = \dfrac{1}{2}\left(\dfrac{\partial y_{cat}}{\partial w_i}\right)^2

So to minimize we need the gradient of this error:

\dfrac{\partial F}{\partial w_i} = \dfrac{1}{2} \dfrac{\partial}{\partial w_i} \left(\dfrac{\partial y_{cat}}{\partial w_i}\right)^2 = 0

Applying the chain rule:

\dfrac{\partial F}{\partial w_i} = \dfrac{\partial y_{cat}}{\partial w_i} \dfrac{\partial^2 y_{cat}}{\partial w_i^2} = 0

SGD update rule

And so we can modify our SGD update rule to:

\Delta w_i = -\alpha \partial E/\partial w_i - \beta \dfrac{\partial y_{cat}}{\partial w_i} \dfrac{\partial^2 y_{cat}}{\partial w_i^2}

Where \alpha and \beta are learning rate hyperparameters.

Conclusion

We finished with a new SGD update rule. I have no idea if this actually will be any better, and the only way to find out is to actually test. This is left as an exercise for the reader 😀

Best-fitting to a Cumulative distribution function in python TensorFlow

I wanted to find a best fit curve for some data points when I know that the true curve that I’m predicting is a parameter free Cumulative Distribution Function.  I could just do a linear regression on the points, but the resulting function, y might not have the properties that we desire from a CDF, such as:

  • Monotonically increasing
  • y(x) \xrightarrow{x\to\infty} 1      i.e. That it tends to 1 as x approaches positive infinity
  • y(x) \xrightarrow{x\to-\infty} 0      i.e. That it tends to 0 as x approaches negative infinity

First, I take the second two points.  To deal with this, I use the sigmoid function to create a new parameter, x’:

x'(x) = \textrm{sigmoid}(x)

This has the nice property that x'(x) \xrightarrow{x\to\infty} 1 and x'(x) \xrightarrow{x\to-\infty} 0

So now we can find a best fit polynomial on x'.

So:

y(x') = a + bx' + cx'^2 + dx'^3 + .. + zx'^n

But with the boundary conditions that:

y(0) = 0 and y(1) = 1

Which, solving for the boundary conditions, gives us:

y(0) = 0 = a
b = (1-c-d-e-...-z)

Which simplifies our function to:

y(x') = (1-c-d-e-..-z)x' + cx'^2 + dx'^3 + ex'^4 .. + zx'^n

Implementing this in tensorflow using stochastic gradient descent (code is below) we get:

figure_1-2

20 data samples

figure_1-1

100 data samples

(The graph title equation is wrong.  It should be   x' = sigmoid(x) then y(x') = ax' +...  I was too lazy to update the graphs sorry)

Unfortunately this curve doesn’t have one main property that we’d like – it’s not monotonically increasing – it goes above 1. I thought about it for a few hours and couldn’t think of a nice solution.

I’m sure all of this has been solved properly before, but with a quick google I couldn’t find anything.

Edit: I’m updating this 12 days later. I have been thinking in the background about how to enforce that I want my function to be monotonic. (i.e. always increasing/decreasing) I was certain that I was just being stupid and that there was a simple answer. But by complete coincidence, I watched the youtube video Breakthroughs in Machine Learning – Google I/O 2016 in which the last speaker mentioned this restriction.  It turns out that this a very difficult problem – with the first solutions having a time complexity of O(N^4):

I didn’t understand what exactly google’s solution for this is, but they reference their paper on it:  Monotonic Calibrated Interpolated Look-up Tables, JMLR 2016.  It seems that I have some light reading to do!

#!/usr/bin/env python

import tensorflow as tf

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from scipy.stats import norm

# Create some random data as a binned normal function
plt.ion()
n_observations = 20
fig, ax = plt.subplots(1, 1)
xs = np.linspace(-3, 3, n_observations)

ys = norm.pdf(xs) * (1 + np.random.uniform(-1, 1, n_observations))

ys = np.cumsum(ys)

ax.scatter(xs, ys)
fig.show()
plt.draw()

highest_order_polynomial = 3

# We have our data now, so on with the tensorflow
# Setup the model graph

# Our input is an arbitrary number of data points (That's what the 'None dimension means)
# and each input has just a single value which is a float

X = tf.placeholder(tf.float32, [None])
Y = tf.placeholder(tf.float32, [None])

# Now, we know data fits a CDF function, so we know that
# ys(-inf) = 0   and ys(+inf) = 1

# So let's set:

X2 = tf.sigmoid(X)

# So now X2 is between [0,1]

# Let's now fit a polynomial like:
#
#   Y = a + b*(X2) + c*(X2)^2 + d*(X2)^3 + .. + z(X2)^(highest_order_polynomial)
#
# to those points.  But we know that since it's a CDF:
#   Y(0) = 0 and y(1) = 1
#
# So solving for this:
#   Y(0) = 0 = a
#   b = (1-c-d-e-...-z)
#
# This simplifies our function to:
#
#   y = (1-c-d-e-..-z)x + cx^2 + dx^3 + ex^4 .. + zx^(highest_order_polynomial)
#
# i.e. we have highest_order_polynomial-2  number of weights
W = tf.Variable(tf.zeros([highest_order_polynomial-1]))

# Now set up our equation:
Y_pred = tf.Variable(0.0)
b = (1 - tf.reduce_sum(W))
Y_pred = tf.add(Y_pred, b * tf.sigmoid(X2))

for n in xrange(2, highest_order_polynomial+1):
    Y_pred = tf.add(Y_pred, tf.mul(tf.pow(X2, n), W[n-2]))

# Loss function measure the distance between our observations
# and predictions and average over them.
cost = tf.reduce_sum(tf.pow(Y_pred - Y, 2)) / (n_observations - 1)

# Regularization if we want it.  Only needed if we have a large value for the highest_order_polynomial and are worried about overfitting
# It's also not clear to me if regularization is going to be doing what we want here
#cost = tf.add(cost, regularization_value * (b + tf.reduce_sum(W)))

# Use stochastic gradient descent to optimize W
# using adaptive learning rate
optimizer = tf.train.AdagradOptimizer(100.0).minimize(cost)

n_epochs = 10000
with tf.Session() as sess:
    # Here we tell tensorflow that we want to initialize all
    # the variables in the graph so we can use them
    sess.run(tf.initialize_all_variables())

    # Fit all training data
    prev_training_cost = 0.0
    for epoch_i in range(n_epochs):
        sess.run(optimizer, feed_dict={X: xs, Y: ys})

        training_cost = sess.run(
            cost, feed_dict={X: xs, Y: ys})

        if epoch_i % 100 == 0 or epoch_i == n_epochs-1:
            print(training_cost)

        # Allow the training to quit if we've reached a minimum
        if np.abs(prev_training_cost - training_cost) &amp;lt; 0.0000001:
            break
        prev_training_cost = training_cost
    print &quot;Training cost: &quot;, training_cost, &quot; after epoch:&quot;, epoch_i
    w = W.eval()
    b = 1 - np.sum(w)
    equation = &quot;x' = sigmoid(x), y = &quot;+str(b)+&quot;x' +&quot; + (&quot; + &quot;.join(&quot;{}x'^{}&quot;.format(val, idx) for idx, val in enumerate(w, start=2))) + &quot;)&quot;;
    print &quot;For function:&quot;, equation
    pred_values = Y_pred.eval(feed_dict={X: xs}, session=sess)
    ax.plot(xs, pred_values, 'r')
    plt.title(equation)

fig.show()
plt.draw()
#ax.set_ylim([-3, 3])
plt.waitforbuttonpress()

Four point gradient as a GLSL shader

For a client I needed a way to draw a gradient in realtime generated from four points, on a mobile phone.  To produce an image like this, where the four gradient points can be moved in realtime:

gradient_with_smoothstep

I wanted to implement this using a shader, but after googling for a while I couldn’t find an existing implementation.

 

The problem is that I have the desired colours at four points, and need to calculate the colour at any other arbitrary point.

 

gradient

So we have three simultaneous equations (vectors in bold):

\begin{array} {ccccc} \boldsymbol{A} &=& \boldsymbol{P_0} (1-u) &+& \boldsymbol{P_1} u \\ \boldsymbol{B} &=& \boldsymbol{P_2} (1-v) &+& \boldsymbol{P_3} v \\ \boldsymbol{C} &=& \boldsymbol{A} (1-t) &+& \boldsymbol{B} t \end{array}

The key to solving this is to reduce the number of freedoms by setting u = v.

\begin{array} {ccccc} \boldsymbol{A} &=& \boldsymbol{P_0} (1-u) &+& \boldsymbol{P_1} u \\ \boldsymbol{B} &=& \boldsymbol{P_2} (1-u) &+& \boldsymbol{P_3} u \\ \boldsymbol{C} &=& \boldsymbol{A} (1-t) &+& \boldsymbol{B} t \end{array}

Solving for t gives us:

(\boldsymbol{A-B})t = (\boldsymbol{A-C})

And solving for \boldsymbol{A} and \boldsymbol{B}:

\begin{array} {ccl} \boldsymbol{A} &=& \boldsymbol{P_0} (1-u) + \boldsymbol{P_1}(u)\\ &=& \boldsymbol{P_0} + u(\boldsymbol{P_1} - \boldsymbol{P_0})\\ \boldsymbol{B} &=& \boldsymbol{P_2} (1-u) + \boldsymbol{P_3}(u)\\ &=& \boldsymbol{P_2} + u(\boldsymbol{P_3} - \boldsymbol{P_2}) \end{array}

so:

\boldsymbol{A}-\boldsymbol{B} = \boldsymbol{P_0} - \boldsymbol{P_2} + u(\boldsymbol{P_1} - \boldsymbol{P_0} + \boldsymbol{P_2} - \boldsymbol{P_3}) \Rightarrow \\\\ (\boldsymbol{P_0} - \boldsymbol{P_2} + u(\boldsymbol{P_1} - \boldsymbol{P_0} + \boldsymbol{P_2} - \boldsymbol{P_3}))t = \boldsymbol{P_0} + u(\boldsymbol{P_1} - \boldsymbol{P_0}) - \boldsymbol{C}\\

Defining:

\begin{array} {ccl} \boldsymbol{Q} &=& \boldsymbol{P_0} - \boldsymbol{P_2}\\ \boldsymbol{R} &=& \boldsymbol{P_1} - \boldsymbol{P_0}\\ \boldsymbol{S} &=& \boldsymbol{R} + \boldsymbol{P_2} - \boldsymbol{P_3}\\ \boldsymbol{T} &=& \boldsymbol{P_0} - \boldsymbol{C} \end{array}

Lets us now rewrite (\boldsymbol{A-B})t = (\boldsymbol{A-C}) as:

(\boldsymbol{Q} + u\boldsymbol{S})t = \boldsymbol{T} + u\boldsymbol{R}

Which written out for x and y is:

(Q.x + u S.x)t = T.x + uR.x\\ (Q.y + u S.y)t = T.y + uR.y

Now we have to consider the various different cases. If Q.x == 0\ \&\&\ S.x == 0 then we have:

u = -T.x/R.x\\ t = (T.y + uR.y) / (Q.y + u S.y)

Likewise if Q.y == 0\ \&\&\ S.y == 0 then we have:

u = -T.y/R.y\\ t = (T.x + uR.x) / (Q.x + u S.x)

Otherwise we are safe to divide by (Q.y + u S.y) and so:

\\ (Q.x + u S.x)/(Q.y + u S.y) = (T.x + uR.x)/(T.y + uR.y)\\ \\ (Q.x + u S.x)(T.y + uR.y) = (T.x + uR.x)(Q.y + u S.y)\\ \\ (Q.x + u S.x)T.y +(Q.x + u S.x) uR.y = (T.x + uR.x)Q.y + (T.x + uR.x)u S.y\\ \\ Q.x T.y + u S.x T.y +Q.x uR.y + u S.x uR.y = T.x Q.y + uR.x Q.y + T.x u S.y + uR.x u S.y\\ \\ Q.x T.y - T.x Q.y + u (S.x T.y - T.x S.y + Q.x R.y - R.x Q.y) + u^2 (S.x R.y - R.x S.y) = 0\\ \\ A := S.x R.y - R.x S.y\\ B := S.x T.y - T.x S.y + Q.x R.y - R.x Q.y\\ C := Q.x T.y - T.x Q.y\\ \\ A u^2 + B u + C = 0\\

Which can be solved with the quadratic formula when A is not zero: (Note that only the positive solution makes physical sense for us, and we can assume A \neq 0 )
u = \dfrac{-B+\sqrt{B^2 - 4AC}}{2A}

and when A is zero (or, for computational rounding purposes, close to zero):

u = \dfrac{-B}{C}

Math Summary

We can determine u and t with:

\begin{array} {ccl} \boldsymbol{Q} &=& \boldsymbol{P_0} - \boldsymbol{P_2}\\ \boldsymbol{R} &=& \boldsymbol{P_1} - \boldsymbol{P_0}\\ \boldsymbol{S} &=& \boldsymbol{R} + \boldsymbol{P_2} - \boldsymbol{P_3}\\ \boldsymbol{T} &=& \boldsymbol{P_0} - \boldsymbol{C}\\ A &:=& S.x R.y - R.x S.y\\ B &:=& S.x T.y - T.x S.y + Q.x R.y - R.x Q.y\\ C &:=& Q.x T.y - T.x Q.y\\ u &=& \begin{cases} -T.x/R.x, & \mbox{if } \mbox{ Q.x == 0.0\ \&\&\ S.x == 0.0} \\ -T.y/R.y, & \mbox{if } \mbox{ Q.y == 0.0\ \&\&\ S.y == 0.0} \\ \dfrac{-B}{C}, & \mbox{if A == 0} \\ \dfrac{-B+\sqrt{B^2 - 4AC}}{2A}, & \mbox{otherwise} \\ \end{cases} \\\\ t &=& \begin{cases} \dfrac{T.y + u R.y}{Q.y + u S.y}, & \mbox{if } \mbox{ Q.x == 0.0\ \&\&\ S.x == 0.0} \\ \dfrac{T.y + u R.y}{Q.y + u S.y}, & \mbox{otherwise} \\ \end{cases} \end{array}

Implementation as a shader:

This can be implemented as a shader as following: (The following can be run directly on https://www.shadertoy.com for example)

void mainImage( out vec4 fragColor, in vec2 fragCoord )
{

    //Example colors.  In practise these would be passed in
    // as parameters
    vec4 color0 = vec4(0.918, 0.824, 0.573, 1.0); // EAD292
    vec4 color1 = vec4(0.494, 0.694, 0.659, 1.0); // 7EB1A8
    vec4 color2 = vec4(0.992, 0.671, 0.537, 1.0); // FDAB89
    vec4 color3 = vec4(0.859, 0.047, 0.212, 1.0); // DB0C36

    vec2 uv = fragCoord.xy / iResolution.xy;

    //Example coordinates.  In practise these would be passed in
    // as parameters
    vec2 P0 = vec2(0.31,0.3);
    vec2 P1 = vec2(0.7,0.32);
    vec2 P2 = vec2(0.28,0.71);
    vec2 P3 = vec2(0.72,0.75);

    vec2 Q = P0 - P2;
    vec2 R = P1 - P0;
    vec2 S = R + P2 - P3;
    vec2 T = P0 - uv;

    float u;
    float t;

    if(Q.x == 0.0 &amp;amp;&amp;amp; S.x == 0.0) {
        u = -T.x/R.x;
        t = (T.y + u*R.y) / (Q.y + u*S.y);
    } else if(Q.y == 0.0 &amp;amp;&amp;amp; S.y == 0.0) {
        u = -T.y/R.y;
        t = (T.x + u*R.x) / (Q.x + u*S.x);
    } else {
        float A = S.x * R.y - R.x * S.y;
        float B = S.x * T.y - T.x * S.y + Q.x*R.y - R.x*Q.y;
        float C = Q.x * T.y - T.x * Q.y;
        // Solve Au^2 + Bu + C = 0
        if(abs(A) < 0.0001)
            u = -C/B;
        else
	    u = (-B+sqrt(B*B-4.0*A*C))/(2.0*A);
        t = (T.y + u*R.y) / (Q.y + u*S.y);
    }
    u = clamp(u,0.0,1.0);
    t = clamp(t,0.0,1.0);

    // These two lines smooth out t and u to avoid visual 'lines' at the boundaries.  They can be removed to improve performance at the cost of graphics quality.
    t = smoothstep(0.0, 1.0, t);
    u = smoothstep(0.0, 1.0, u);

    vec4 colorA = mix(color0,color1,u);
    vec4 colorB = mix(color2,color3,u);
    fragColor = mix(colorA, colorB, t);
}

And the same code again, but formatted as a more typical shader and with parameters:

          uniform lowp vec4 color0;
          uniform lowp vec4 color1;
          uniform lowp vec4 color2;
          uniform lowp vec4 color3;
          uniform lowp vec2 p0;
          uniform lowp vec2 p1;
          uniform lowp vec2 p2;
          uniform lowp vec2 p3;
          varying lowp vec2 coord;

          void main() {
            lowp vec2 Q = p0 - p2;
            lowp vec2 R = p1 - p0;
            lowp vec2 S = R + p2 - p3;
            lowp vec2 T = p0 - coord;

            lowp float u;
            lowp float t;

            if(Q.x == 0.0 &amp;amp;&amp;amp; S.x == 0.0) {
                u = -T.x/R.x;
                t = (T.y + u*R.y) / (Q.y + u*S.y);
            } else if(Q.y == 0.0 &amp;amp;&amp;amp; S.y == 0.0) {
                u = -T.y/R.y;
                t = (T.x + u*R.x) / (Q.x + u*S.x);
            } else {
                float A = S.x * R.y - R.x * S.y;
                float B = S.x * T.y - T.x * S.y + Q.x*R.y - R.x*Q.y;
                float C = Q.x * T.y - T.x * Q.y;
                // Solve Au^2 + Bu + C = 0
                if(abs(A) < 0.0001)
                    u = -C/B;
                else
                    u = (-B+sqrt(B*B-4.0*A*C))/(2.0*A);
                t = (T.y + u*R.y) / (Q.y + u*S.y);

            }
            u = clamp(u,0.0,1.0);
            t = clamp(t,0.0,1.0);

            // These two lines smooth out t and u to avoid visual 'lines' at the boundaries.  They can be removed to improve performance at the cost of graphics quality.
            t = smoothstep(0.0, 1.0, t);
            u = smoothstep(0.0, 1.0, u);

            lowp vec4 colorA = mix(color0,color1,u);
            lowp vec4 colorB = mix(color2,color3,u);
            gl_FragColor = mix(colorA, colorB, t);
          }

Result

On the left is the result using smoothstep to smooth t and u, and on the right is the result without it.  Although the non-linear smoothstep is computationally-expensive and a hack, I wasn’t happy with the visual lines that resulted in not using it.