This blog post is going to break down the meat and potatoes of machine learning algorithms - gradient descent. This builds on top of the ideas presented in my post on linear regression, so I recommend checking out that post first. All that aside, let's get into it.
What is gradient descent trying to do? At it's core, the purpose of gradient descent is to figure out some function that would create a line which fits a given data set. More specifically, it adjusts the weights and biases (or in math terms, the slopes and intercept) in order to find the best line. How does it do this?
To start, let us image we have three data points on a graph with coordinates as follows:
(0, 2)
(2, 5)
(6, 13)
Now, given linear regression, we would have to solve for the loss of every possible line and plot that loss on a graph. Then we would find the line through plug and chug which line produces the least loss. In order to calculate loss, we use an test function like f(x) = 4x + 1. plugging in the above values for x, we would find that our example function provides y values of 1, 9, and 25 respectively. This compares to the actual values of 2, 5, and 13 respectively. Our loss would be the difference in our calculated or predicted values and our actual values. These losses are 1, 4, and 12 respectively. Using the loss function sum of squared residuals (residuals being the actual - predicted), we can find the total loss to be 161 (1 + 16 + 144).
Our loss function can be represented then as shown below:
Sum of Squared Residuals = (actual(data1) - predicted(data1))^2 + (actual(data2) - predicted(data2))^2 + (actual(data3) - predicted(data3))^2
If we start plugging in numbers from our example we get:
Sum of Squared Residuals = (2 - (x(0) + 1))^2 + (5 - (x(2) + 1))^2 + (13 - (x(6) + 1))^2
Having this in a proper function allows us to solve for the variable term. Note - to start we are using a constant value for the intercept to make the math simple. In practice we will solve for both the slope (x) and the intercept (right now it is set to 1).
If we want to find where the derivative of this loss function is equivalent to 0, then we must have the derivative of the line in respect to the slope. Using the chain rule, we can find that derivative to be below:
d/d slope Sum of Squared Residuals = -2*(0)(2 - (x(0) + 1)) + -2*(2)(5 - (x(2) + 1)) + -2*(6)(13 - (x(6) + 1))
Now we can pick a random value for x, and plug in the formula to solve for the slope at that point. If we set x = 1, then plugging everything in, we find the derivative (or slope of the new graph) to be -80. Knowing this current derivative, we can apply that to the current slope value (x) and calculate the new derivative. However, -80 is most likely a very big step, so we want to multiply that number by some term to prevent us from taking a step so large that we overshoot our target. So we multiply -80 by a "learning rate". Let's choose .01.
-80 * .01 = -0.8 - this will be what we call the "step size." We then subtract the step size from our previous guess for the slope to get a new slope to check the derivative of.
1 - (step size) = 1 - (-0.8) = 1.8
Now we can plug our new slope into our derivative function and we find the derivative to be -16. Multiply by our learning rate to get a step size of -0.16, and subtract that from our slope, we can find a new slope of 1.96. We can repeat this step over and over until either we have taken a "Max Steps" number of steps - normally somewhere around 1000 - or we have a derivative value less than some minimum value - let's choose .001. I will provide the corresponding derivatives, and slopes below for the rest of the calculations.
slope=1.0000, derivative=-80.0000
slope=1.8000, derivative=-16.0000
slope=1.9600, derivative=-3.2000
slope=1.9920, derivative=-0.6400
slope=1.9984, derivative=-0.1280
slope=1.9997, derivative=-0.0256
slope=1.9999, derivative=-0.0051
slope=2.0000, derivative=-0.0010
Slope = 1.99999744
This means that given an intercept of 1, our line of best fit for the data would be f(x) = 1.99x + 1.
Now we can apply this same logic to figure out not just the slope, but the intercept and the slope. We can do this by taking the derivative with respect to the slope, and taking the derivative with respect to the intercept, then adjusting both numbers according to their individual step size. When we take the derivative with respect to multiple variables, this is called the gradient. Since we are finding how the gradient results to the lowest value, this is where the term gradient descent comes from. We can find the two different gradient formulas below:
d/d intercept Sum of Squared Residuals = -2(2 - (x(0) + y)) + -2(5 - (x(2) + y)) + -2(13 - (x(6) + y))
d/d slope Sum of Squared Residuals = -2*(0)(2 - (x(0) + y)) + -2*(2)(5 - (x(2) + y)) + -2*(6)(13 - (x(6) + y))
x represents the slope, and y represents the intercept. Using the same calculations as before but applied to both derivatives and using the values for x and y (slope and intercept), we can predict the best slope and the best intercept for the line. The resulting slopes and intercepts are shown below:
slope=1.800, intercept=1.052
slope=1.952, intercept=1.077
slope=1.978, intercept=1.096
slope=1.980, intercept=1.113
slope=1.978, intercept=1.130
slope=1.975, intercept=1.146
slope=1.972, intercept=1.162
slope=1.968, intercept=1.177
slope=1.965, intercept=1.192
slope=1.962, intercept=1.206
slope=1.959, intercept=1.221
slope=1.957, intercept=1.234
slope=1.954, intercept=1.248
slope=1.951, intercept=1.261
slope=1.949, intercept=1.273
slope=1.946, intercept=1.285
slope=1.944, intercept=1.297
slope=1.941, intercept=1.309
slope=1.939, intercept=1.320
slope=1.937, intercept=1.331
slope=1.934, intercept=1.342
slope=1.932, intercept=1.352
slope=1.930, intercept=1.362
slope=1.928, intercept=1.372
slope=1.926, intercept=1.381
slope=1.924, intercept=1.391
slope=1.922, intercept=1.400
slope=1.921, intercept=1.408
slope=1.919, intercept=1.417
slope=1.917, intercept=1.425
slope=1.915, intercept=1.433
slope=1.914, intercept=1.441
slope=1.912, intercept=1.449
slope=1.911, intercept=1.456
slope=1.909, intercept=1.463
slope=1.908, intercept=1.470
slope=1.906, intercept=1.477
slope=1.905, intercept=1.483
slope=1.904, intercept=1.490
slope=1.902, intercept=1.496
slope=1.901, intercept=1.502
slope=1.900, intercept=1.508
slope=1.899, intercept=1.514
slope=1.898, intercept=1.519
slope=1.896, intercept=1.525
slope=1.895, intercept=1.530
slope=1.894, intercept=1.535
slope=1.893, intercept=1.540
slope=1.892, intercept=1.545
slope=1.891, intercept=1.550
slope=1.890, intercept=1.554
slope=1.889, intercept=1.559
slope=1.888, intercept=1.563
slope=1.888, intercept=1.567
slope=1.887, intercept=1.571
slope=1.886, intercept=1.575
slope=1.885, intercept=1.579
slope=1.884, intercept=1.583
slope=1.884, intercept=1.587
slope=1.883, intercept=1.590
slope=1.882, intercept=1.594
slope=1.881, intercept=1.597
... 200 steps later ...
slope=1.857, intercept=1.714
The above slope and intercept then provide the values for the line of best fit. This means our line of best fit is
f(x) = 1.857x + 1.714
This is gradient descent and it can be used not just on linear equations, but polynomial ones too. One thing to note, taking all this math for every data point can become cumbersome even for computers. With just three data points it wasn't too bad, but when you have millions of data points it can become close to impossible. The way we work around this is with a method none as Stochastic Gradient Descent. This basically just means that we take a random subsect of the data and calculate the loss function based solely on that subsect. Overall, this algorithm helps to achieve a solution way faster than plug and chug and it is the basis upon which all neural networks work.