The Loopless Loop (or How I made my code run 7,000 times faster)

Part 1: Ugly math, ugly code

As I embarked on my journey to learn the math side of machine learning, all of the blog posts seemed to point to linear algebra as the starting point. The problem was, nothing I read made it immediately clear how linear algebra played a role in machine learning.

Worse yet, the whole discussion of vectors, matrices, linear combinations, and linear transformations seemed completely disconnected from my layman’s understanding of machine learning.

The Khan Academy videos on linear algebra are quite tedious and I didn’t feel I was getting anywhere. 3blue1brown’s Linear Algebra series is much more engaging and lucid, but I was still getting bogged down in the theory without seeing the application.

Needless to say, it was pretty slow going.

My linear algebra a-ha moment

It wasn’t until I switched gears and decided, on a whim, to tackle linear regression that linear algebra really started to click for me. On a practical level, code that uses linear-algebraic methods simplifies work for the computer by orders of magnitude, making it possible to process massive datasets—and process them rapidly. This is obviously a critical requirement in the age of Big Data.

And on a conceptual level, it gave me my first mathematical a-ha moment:

By manipulating matrices and vectors, you can achieve the same outcome as a loop without explicitly looping—a loopless loop.

That is why linear algebra is the cornerstone of machine learning.

The only thing is, no single resource I found on the internet seemed to really clarify the mathematical and programmatic aspects at the same time without getting too abstract on the math side of things.

As a self-taught and formerly math-phobic coder, I needed a guide that progressed from from inelegant code (which I could understand) and inelegant math to (mind-blowingly) elegant math, which would then lay the groundwork for writing extremely elegant—and performant—code (which is incomprehensible without understanding the math).

This is that guide, created from my notes from Week 1 of my machine learning journey. I’ve split it up into multiple posts, since it’s quite long.

How to make a computer explode

I always thought that for-loops were simply a fact of life.

While I understood the basic principle behind stochastic gradient descent, I had never implemented it myself before. If I had tried my hand at it before learning the math involved, I probably would have come up with this monster:

# DON'T TRY THIS

samples = [ [ ( list of lists) ] ]
actual = [ ... ]
def train_lr_model(samples, actual, epochs, learning_rate):
    weights = [0 for feature in samples[0]] # 0 to start
    for _ in range(epochs):
        for j, weight in enumerate(weights):
            summation_part_of_cost = 0
            for i, sample in enumerate(samples):
                sum_of_sq_errs = (sum(hypothesis(sample, weights)) - actual[i]) ** 2
                summation_part_of_cost += sum_of_sq_errs * sample[j]
            partial_deriv = (1 / (2 * len(samples))) * summation_part_of_cost
            weights[j] -= learning_rate * partial_deriv
    return weights

# I REALLY HOPE YOU DIDN'T TRY THIS

Okay, maybe not. (It would be pretty hard to write this without understanding the math involved.)

Count those loops—three, to be exact! Terrifying. Now, being handy with Python, I could probably have calculated partial_deriv in one line with an even more terrifying list comprehension, just to show off:

partial_deriv = (1 / (2 * len(samples))) * sum([sample[j] * (sum(hypothesis(sample, weights)) - actual[i]) ** 2) for i, sample in enumerate(samples)])

…which shaves off four lines, at the expense of all readability. But bigger problems remain:

The time complexity of this program is off the charts. In Big-O time complexity terms, it is at least $O(n^3)$, if not more, which is cubic time, or (literally) exponentially horrible.
It doesn’t even work. Even with a dataset of unremarkable size, you’re bound to get a RuntimeWarning: overflow encountered in square error that can’t be avoided even if you set the learning_rate to an impractically small value like 0.00001. Trust me, I’ve tried.

Is there any way beyond this impasse?

Meeting a Zen master on the road

In Zen Buddhism, there is a famous book called The Gateless Gate, which is a collection of koans. A Zen koan is a riddle that cannot be approached with the rational mind. For example:

Goso said: “When you meet a Zen master on the road, you cannot talk to him, but neither can you face him with silence. What are you going to do?”

To solve it, you have to transcend the duality of this and not-this.

Give him an uppercut
And you will be called one who understands Zen.
—The Gateless Gate, Koan #36

As you might imagine, linear algebra—the loopless loop—is the Zen uppercut of coding. What, then is the target of our uppercut?

Linear regression: A basic overview

Basically, linear regression is used to make predictions about a thing based on its characteristics, assuming that

that there is some correlation among those characteristics and
that you have plenty of data about other things.

The intuition here can be explained with middle school algebra. Imagine you know the square footage and price of 200 houses and you want to estimate the price of a house with a given square footage.

Obviously, this is an oversimplified correlation for the sake of example. (And for some reason, everyone seems to explain this concept using houses, so why reinvent the wheel?)

If you were to make a scatter plot of that data, with the area along the x-axis and the price along the y-axis, the pattern might roughly look like it follows a line—not perfectly linear, but linear enough to predict the price of a house with $x$ square footage. You can use linear regression to work backward from the data to determine this line, the line of best fit.

In middle school algebra, lines are written in the form

$$ y = \textcolor{magenta}{m}\textcolor{teal}{x} + \textcolor{orange}{b} $$

where $\textcolor{teal}{x}$ is the input, $\textcolor{magenta}{m}$ is the slope of the line, $\textcolor{orange}{b}$ moves the line up and down on the graph, and $y$ is the height of the line at point $\textcolor{teal}{x}$.

Our house problem can also be framed as a line, where $\textcolor{teal}{x}$ is the square footage, which influences the price by some value $\textcolor{magenta}{m}$, to which we add some kind of base price to bring us to the final price, $y$.

Well, that was easy enough, right?

Multiple linear regression: Because more is better (or something)

In the real world, of course, area is not the only factor that decides the price of a house. There are many others. Can we still adapt our middle school equation to this problem if each house has 3 features—say, area, nearby property values, and age of the building?

$$ y = \textcolor{magenta}{m}\textcolor{teal}{x} + \textcolor{magenta}{n}\textcolor{teal}{z} + \textcolor{magenta}{o}\textcolor{teal}{a} + \textcolor{orange}{b} $$

We can, but it’s messy. (It’s also no longer a line, but let’s ignore that for now.) First, let’s rewrite that “base price” as $b \cdot 1$.

$$ y = \textcolor{magenta}{m}\textcolor{teal}{x} + \textcolor{magenta}{n}\textcolor{teal}{z} + \textcolor{magenta}{o}\textcolor{teal}{a} + \textcolor{magenta}{b}\cdot\textcolor{teal}{1} $$

This gives us a nice symmetry: Notice that all of the teal variables are features, which are multiplied by their degree of influence (called a coefficient in statistics, or a weight in machine learning). When you add all these together, you get the price of the house.

This is called multiple linear regression. Most people wouldn’t skip directly to multiple LR after introducing single LR, but single LR is pretty easy to digest if you can understand high school calculus (derivatives), so it didn’t level up my math knowledge.

Now, let’s code our equation, putting all the feature values into a list features and the weights into another list weights. In Python, in increasing order of elegance, we can write the following:

# Using completely random numbers just to show the code
features = [1, 3, 4, 5] # 3 features; first value is the “dummy feature”
weights = [0.6, 0.2, 0.1, 0.75]

# Brute-force addition
price = features[0] * weights [0] + features[1] * weights [1] + features[2] * weights [2] + features[3] * weights [3]

If we use $x$s to denote our features, $θ$s to denote our weights, and subscript numbers to denote the position of each item in a list (series), then we can rewrite our equation in a slightly more organized way:

$$ h_θ(x) = \textcolor{magenta}{θ_0}\textcolor{lightgray}{x_0} + \textcolor{magenta}{θ_1}\textcolor{teal}{x_1} + \cdots + \textcolor{magenta}{θ_n}\textcolor{teal}{x_n} $$

which just happens to be the generalized form of linear regression—general in the sense that it can accommodate any number of features, whether that’s 1 or 1,000.

Here, $x$ is the collection of all feature values $\textcolor{teal}{x_1}$ through $\textcolor{teal}{x_n}$, where $n$ is the number of features. (And $x_{1000}$ actually isn’t too crazy in terms of real-world datasets!) It also includes the dummy feature $\textcolor{lightgray}{x_0} = 1$. Likewise, $θ$ is the collection of all weights, including $\textcolor{magenta}{θ_0}$, the “base price” in our example. In machine learning, this is called the bias value.

Finally, the function notation $h_θ(x)$ indicates that this is the hypothesis for item (house) $x$ given the collection of weights $θ$.

How to Python in math (Lesson 1)

If you know anything about programming, you know that the last line of code above is no way to write a program. Accommodating 100 features would be a chore, and accommodating a variable number of features would be impossible. Naturally, we would use the magic of iteration:

# Super-basic iteration over the lists
price = 0
for j in len(features):
  price += features[j] * weights[j]

No one actually writes the linear regression formula like this, but if you wanted to, you could express the above code in math using a summation:

$$ h_θ(x) = \sum_{j=0}^n \textcolor{magenta}{θ_j}\textcolor{teal}{x_j} $$

That Greek letter $\sum$ (sigma) means summation. Basically, run a for-loop that adds the result of the following expression for each sample (house) starting at $j=0$ until $j=n$.

If you know Python better than you know math (as I did), then you might try further refactoring the code:

# Functional programming version
price = 0
for f, w in zip(features, weights):
  price += f * w

# More Pythonic version of the above
price = [f * w for f, w in zip(features, weights)]

I didn’t realize this when I wrote the first draft of this post, but even expressing the simple Python function zip(x, y) in math requires linear algebra. I’ll get back to the non-fancy version of linear regression in just a minute, but for sake of thoroughness, if $x$ and $θ$ are both vectors, then

$$ \begin{aligned} \vec x = \begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_n \end{bmatrix} & \qquad{} \vec θ = \begin{bmatrix} θ_0 \\ θ_1 \\ \vdots \\ θ_n \end{bmatrix} & \quad{} z(\vec x, \vec θ) = \vec θ\vec x \end{aligned} $$

Similarly, we can rewrite our equation as a function in Python, too. Let’s leave vectors aside for now, but keep the zip and encapsulate it into a function, because it’s clean and easy to understand.

def hypothesis(features, weights):
    return [f * w for f, w in zip(features, weights)]

Einstein notation: Making math less confusing by making it more confusing

This is all great if there’s only one $x$ (house). But we will need tons of houses to make a decent prediction. Our list houses needs to be changed into a list of lists. For the sake of example, if we had three houses and three features, houses would look like this (remember the dummy feature):

houses = [[1, 7, 4, 3]
          [1, 6, 7, 3],
          [1, 2, 9, 6]] # random values

This also allows us to refer to specific features of specific houses using two indexes, houses[i][j]. How do we do this in math, though? Enter Einstein notation.

$$ h_θ(x^i) = θ_0x_0^i + θ_1x_1^i + \cdots + θ_nx_n^i $$

The superscript numbers here aren’t exponents. You would think Einstein, of all people, could come up with something less confusing, but that is the convention, so it’s important to become familiar with it.

Just remember that in linear regression, there are conventional choices for the variable names. $i$ denotes the $i$th house, and $j$ the $j$th feature.

$$ x^{\textcolor{orange}{i \textrm{th sample}}}_{\textcolor{blue}{j \textrm{th feature}}} $$

So if we were to start describing each hypothesis for our dataset individually,

$$ h_θ(x^i) = \left\{\begin{array}{ll} h_θ(x^0) = θ_0x_0^0 + θ_1x_1^0 + \cdots + θ_nx_n^0 \\[0.5em] h_θ(x^1) = θ_0x_0^1 + θ_1x_1^1 + \cdots + θ_nx_n^1 \\[0.5em] h_θ(x^2) = θ_0x_0^2 + θ_1x_1^2 + \cdots + θ_nx_n^2 \\[0.5em] h_θ(x^3) = θ_0x_0^2 + θ_1x_1^2 + \cdots + θ_nx_n^2 \end{array}\right. $$

hypothesis(houses[0], weights)
hypothesis(houses[1], weights)
hypothesis(houses[2], weights)
hypothesis(houses[3], weights)

hyps = [hypothesis(houses[house], weights) for house in houses]

Cost function: How accurate are our predictions?

Seeing as the goal of linear regression is to come up with a line that best fits the data, we need some way to evaluate a line’s goodness of fit to the data. One measure of that is the mean squared error.

Error

Error is how far the prediction for one sample (house) is from its actual value.

$$ \begin{aligned} \textrm{error}_i &= \textrm{prediction}_i - \textrm{actual}_i\\ &= \hat{Y}_i - Y_i \end{aligned} $$

Note: The subscript $i$ here is not Einstein notation, because these are just lists of values, not a “spreadsheet” of rows and columns. The Einstein notation in this discussion of linear regression only applies to $x$.

$\hat Y$ is read “Y-hat,” which is just a statistical convention. It can be substituted with our function $h_θ(x^i)$.

$$ \textrm{error}_i = h_θ(x^i) - Y_i $$

actual_values = [3, 4, 5] # 3 houses; random values
weights = [0.6, 0.2, 0.1, 0.75]

def error(i):
    return hypothesis(houses[i], weights) - actual_values[i]

Squared error

We square it so that (1) all values are positive, preventing underestimates and overestimates from canceling each other out; and (2) larger errors are considered proportionally “more erroneous” than smaller errors.

$$ \begin{aligned} \textrm{SE}_i &= \textrm{error}^2\\ &= (\hat{Y}_i - Y_i)^2\\ &= (h_θ(x^i) - Y_i)^2 \end{aligned} $$

def se(i):
    return error(i) ** 2

Mean squared error

The mean value of the squared error for all samples can give us an idea about our line’s goodness of fit.

$$ \begin{aligned} \textrm{MSE} &= \frac{\textrm{SE}_\textcolor{red}{1} \textcolor{blue}{+ \cdots + } \textrm{ SE}_\textcolor{red}{\textrm{number of samples}}}{\textrm{number of samples}}\\ &= \frac{\textcolor{blue}{\sum}_{\textcolor{red}{i=1}}^{\textcolor{red}{m}} \textrm{SE}}{m}\\ &= \frac{\sum_{\textcolor{red}{i=1}}^{\textcolor{red}{m}} (\hat{Y}_{\textcolor{red}i} - Y_{\textcolor{red}i})^2}{m}\\ &= \frac{1}{m} \sum_{\textcolor{red}{i=1}}^{\textcolor{red}{m}} (\hat{Y}_{\textcolor{red}i} - Y_{\textcolor{red}i})^2\\ &= \frac{1}{m} \sum_{\textcolor{red}{i=1}}^{\textcolor{red}{m}} (h_θ(x^{\textcolor{red}i}) - Y_{\textcolor{red}i})^2 \end{aligned} $$

m = len(houses)
total_se = 0
for i in m:
    total_se += se(i)
mse = total_se / m

# More Pythonic and more similar to the actual math notation
mse = sum([se(house) for house in houses]) / len(houses)

Note: MSE gives us an indication of goodness of fit, but it’s difficult to tie that value directly to the data. You can use the RMSE (root mean squared error), which is just the square root of the MSE, to reframe the average error in terms of the data. In this case, the RMSE would tell us how much (in dollars) that our prediction line was off by.

Let’s use the mean squared error to rewrite this equation as a function of the collection of weights. Every time we change the weights, we will obtain a different line with a different goodness of fit (MSE), and this relationship can be illustrated by a function called the cost function.

def cost(weights):
    sum([se(house) for house in houses]) / len(houses)

Well, almost. This function is conventionally named $J$:

$$ J(θ) = \frac{1}{\textcolor{magenta}{2}m} \sum_{i=1}^m (h_θ(x^i) - Y_i)^2 $$

Where did that $\textcolor{magenta}{2}$ come from? Again, this is just a matter of convention. The $\textcolor{magenta}{2}$ will cancel out in the the next step.

def cost(weights):
    1 / (2 * len(houses)) * sum([se(house) for house in houses])

Keep in mind that $θ$ and $Y$ are lists and $x$ is a list of lists. What this means is that in a situation with two features (plus the dummy feature),

$$ \begin{aligned} h_{θ_\textcolor{teal}{0}, θ_\textcolor{teal}{1}, θ_\textcolor{teal}{2}}(x) &= h_{θ_\textcolor{teal}{0}}(x_\textcolor{teal}{0}) + h_{θ_\textcolor{teal}{1}}(x_\textcolor{teal}{1}) + h_{θ_2}(x_\textcolor{teal}{2}) \\ h_{θ_\textcolor{teal}{0}, θ_\textcolor{teal}{1}, θ_\textcolor{teal}{2}}(x^\textcolor{red}{i}) &= h_{θ_\textcolor{teal}{0}}(x_\textcolor{teal}{0}^\textcolor{red}{i}) + h_{θ_\textcolor{teal}{1}}(x_\textcolor{teal}{1}^\textcolor{red}{i}) + h_{θ_\textcolor{teal}{2}}(x_\textcolor{teal}{2}^\textcolor{red}{i}) \\ J(θ_\textcolor{teal}{0}, θ_\textcolor{teal}{1}, θ_\textcolor{teal}{2}) &= \frac{1}{2m} \sum_{\textcolor{red}{i=1}}^\textcolor{red}{m} \Big[h_{θ_\textcolor{teal}{0}}(x_\textcolor{teal}{0}^\textcolor{red}{i}) + h_{θ_\textcolor{teal}{1}}(x_\textcolor{teal}{1}^\textcolor{red}{i}) + h_{θ_\textcolor{teal}{2}}(x_\textcolor{teal}{2}^\textcolor{red}{i}) - Y_{\textcolor{red}i}\Big]^2 \end{aligned} $$

Numbers in teal represent feature numbers; numbers in red represent sample numbers.

Multivariable calculus: How much effect does each weight have?

Now, imagine each feature as knobs on a radio. Increasing or decreasing the weight of each feature is like turning up or down the knob for that feature. We want to “tune” our line to be as close to the data as possible by “dialing” the features up and down. In order to do this, we need to determine the effect that a given combination of knob settings has on the final output.

In math terms, this is akin to asking “How much does $J$ change when $θ$ changes?” Sounds like derivatives from high school calculus.

$$ \begin{aligned} f(x) &= 5x^2 \\ \frac{df}{dx} &= 5\cdot2x^{2-1} \\ &= 10x \end{aligned} $$

Our function $J$ is actually one function inside of another, so the chain rule applies. Bonus points if you remember that from high school—I didn’t.

$$ \begin{aligned} J(θ) &= \textcolor{purple}{\frac{1}{2m}} \textcolor{purple}{\sum_{i=1}^m} \textcolor{purple}{\Big[} \textcolor{orange}{h_θ(x^i)} \textcolor{purple}{- Y_i)\Big]^2} \\ \frac{dJ(θ)}{dθ} &= d_\textcolor{purple}{\textrm{outer}} \cdot d_\textcolor{orange}{\textrm{inner}} \end{aligned} $$

$$ \begin{aligned} \textcolor{purple}{\textrm{outer}} &= \textcolor{purple}{\frac{1}{2m}} \textcolor{purple}{\sum_{i=1}^m} \textcolor{purple}{(} \textcolor{orange}{\textrm{inner}} \textcolor{purple}{- Y_i)^2} & \textcolor{orange}{\textrm{inner}} &= \textcolor{magenta}{θ_0}\textcolor{teal}{x_0} + \textcolor{magenta}{θ_1}\textcolor{teal}{x_1} + \cdots + \textcolor{magenta}{θ_n}\textcolor{teal}{x_n} \\ d_\textcolor{purple}{\textrm{outer}} &= \textcolor{purple}{\frac{\cancel{2}}{\cancel{2}m}} \textcolor{purple}{\sum_{i=1}^m} \textcolor{purple}{\Big[} \textcolor{orange}{h_θ(x^i)} \textcolor{purple}{{- Y_i\Big]}^2} & d_\textcolor{orange}{\textrm{inner}} &= ??? \end{aligned} $$

This is where I got stuck. $θ$ is a collection of values, not just a single value. Each knob on our radio affects the output individually, and we have to determine the individual effect of each knob.

It helps to start by breaking down what the chain rule is actually saying.

$$ \frac{d[\textcolor{purple}{\textrm{outer}}(\textcolor{orange}{\textrm{inner}}(x))]}{dx} = \frac{d_\textcolor{purple}{\textrm{outer}}}{d_\textcolor{orange}{\textrm{inner}}} \cdot \frac{d_\textcolor{orange}{inner}}{dx} $$

This means our “outer derivative” $d_\textcolor{purple}{\textrm{outer}}$ tells us how much our cost function $J(θ)$ changes in response to a given change in our hypothesis $h(x)$. We now need to find the “inner derivative” $d_{\textcolor{orange}{\textrm{inner}}}$, which tells us how much our hypothesis $h(x)$ changes in response to a given change in our weights $θ$.

But since $θ$ is a collection of values, there isn’t a single derivative, but rather several partial derivatives, which indicate how much our hypothesis $h(x^i)$ for a specific sample (house) $x^i$ changes in response to a given change in each of the weights $θ_j$.

Note: Another labeling convention—just as $i$ is used to refer to, or “index,” samples from $1$ through $m$, the total number of samples, lowercase $j$ is used to index features from $0$ through $n$, the total number of features. To return to our Einstein notation,
$$ x^{1 \leq \space i^\textrm{th} \textrm{ sample} \space \leq \space m \textrm{ samples}}_{0 \space \leq \space j^\textrm{th} \textrm{ feature} \space \leq \space n \textrm{ features}} $$

In math notation, this is written with a funny “d” called a “del,” $\partial$, like this:

$$ \frac{\partial h_θ(x^i)}{\partial θ_j} $$

This looks crazy, but the process of finding these partial derivatives is really the same as finding a normal derivative, except you treat all the other variables as constant, effectively ignoring them. So, for we now only have to concern ourselves with

$$ h_{θ_j}(x^i) = \begin{cases} \textcolor{magenta}{θ_0}\textcolor{teal}{x_0^i} \textcolor{lightgray}{+θ_1x_1^i + θ_2x_2^i} & \text{when } j=0 \\ \textcolor{lightgray}{θ_0x_0^i} \textcolor{lightgray}{+} \textcolor{magenta}{θ_1}\textcolor{teal}{x_1^i} \textcolor{lightgray}{+θ_2x_2^i} & \text{when } j=1 \\ \textcolor{lightgray}{θ_0x_0^i + θ_1x_1^i} \textcolor{lightgray}{+} \textcolor{magenta}{θ_2}\textcolor{teal}{x_2^i} & \text{when } j=2 \\ \end{cases} $$

The derivative of a variable times something else is just the something else. (For a line $y = 2x$, $\frac{dy}{dx}$ is just $2$, since its slope will be 2 at every point along the line.) Thus,

$$ \frac{\partial h_θ(x)}{\partial θ_j} = \begin{cases} \frac{\partial h_θ(x)}{\partial θ_0} &= \textcolor{lightgray}{θ_0} \textcolor{teal}{x_0^i} & \text{when } j=0 \\[0.5em] \frac{\partial h_θ(x)}{\partial θ_1} &= \textcolor{lightgray}{θ_1} \textcolor{teal}{x_1^i} & \text{when } j=1 \\[0.5em] \frac{\partial h_θ(x)}{\partial θ_2} &= \textcolor{lightgray}{θ_2} \textcolor{teal}{x_2^i} & \text{when } j=2 \end{cases} $$

Or just

$$ \frac{\partial h_θ(x)}{\partial θ_j} = \left\{\begin{array}{lr} x_0^i & \text{when } j=0 \\[0.5em] x_1^i & \text{when } j=1 \\[0.5em] x_2^i & \text{when } j=2 \end{array}\right\} = \textcolor{red}{x_j^i} $$

That’s $d_{\textcolor{orange}{\textrm{inner}}}$.

$$ \begin{aligned} \frac{\partial{J(θ)}}{\partial{θ_j}} &= d_\textcolor{purple}{\textrm{outer}} \cdot d_\textcolor{orange}{\textrm{inner}} \\ &= \frac{1}{m} \sum_{i=1}^m (h_θ(x^i) - Y_i) \cdot \textcolor{red}{x_j^i} \end{aligned} $$

Phew! That was a lot of abstract math. Finally, we have something that can be translated into code.

def effect_of_weight(which_weight):
    return (1 / m) * sum([error(house) * weights[which_weight] for house in houses])

effect_of_all_weights = [effect_of_weight(j) for j in weights]

Stochastic gradient descent: Making regression lines fit again

Strictly speaking, $\frac{1}{m} \sum_{i=1}^m (h_θ(x^i) - Y_i) \cdot x_j^i$ is not a derivative, but a gradient—a collection of partial derivatives. In high school calculus, the derivative at a given point is visualized as the line that is tangent to the graph’s curve at that point. In multivariable calculus, the gradient at a given point is visualized as the plane that is tangent to the graph’s surface at the point.

In more concrete terms, imagine running a small piece of cardboard around the sides of a coffee mug so that the cardboard follows the curvature of the mug. Every point on the surface of the mug corresponds to some combination of weights, and the closer we are to the top of the mug, the greater the value of our cost function is, and so the more inaccurate our prediction is. We want to find the bottom of the mug, where the piece of cardboard is parallel to the ground, because that is where the value of the cost function is as low as possible.

When that value is zero, the line would fit our data perfectly. However, that’s not possible for real-world data, so we will settle for the lowest value—that is, we want to minimize the cost function.

$$ \underset{θ}{\arg\min} \, J(θ) $$

In the language of math (and neural networks), this is called stochastic gradient descent.

The gradient is the thing we’re trying to minimize.
This process is stochastic (random) because we start with random weights (all zeros), which puts us at a random point on the mug.
It is a descent because we want to move down to a progressively flatter region of the mug with each attempt (combination of weights).

The descent occurs in “steps.” Imagine, for a moment, a basic parabola $f(x) = x^2$ instead of a mug. The derivative at any point $x$ is $2x$. Positive $x$ values give us positive derivatives and negative $x$ values give us negative derivatives. If we started at some point to the right of 0 and wanted to follow the parabola to its trough, we could do that by subtracting something from $x$. Likewise, if we started at some point to the left of 0, we’d want to add something to $x$—or rather, subtract a negative value.

This means that if we start at any point $x$ and subtract $\frac{dy}{dx}$, we will tend toward the trough. We don’t necessarily know exactly what our new $x$ value will be, but we can assume that subtracting $\frac{dy}{dx}$ again will take us closer to the trough, although slightly less closer. Each step brings us increasingly closer but in progressively smaller steps. At some point, we will reach convergence, or a point that is close enough to minimum.

The same applies to gradients. The gradient for any set of weights $θ$ tells us the opposite direction we should go in to find the bottom of the mug. That means that if we start with some initial collection of weights $θ$ and keep subtracting the gradient, which is notated $\nabla J$, we should eventually arrive at the bottom.

$$ \textrm{repeat } θ := θ - \nabla J \textrm{ until convergence} $$

But try translating this into Python.

def gradient_of_cost(my_weights):
    return [effect_of_weight(j) for j in weights]

my_weights = [0,0,0]
last_weights = [0,0,0]
while True:
    my_weights -= gradient_of_cost(last_weights) # ???
    last_weights = my_weights
    if (convergence): # ???
        break

minimum = my_weights # ???

In doing so, a couple of questions arise (besides the suspicion that there are way too many loops and functions):

How do you compare weights so that you can know which of two collections is “lesser” and which is “greater”?
How do you know when you’ve reached convergence?

It‘s clear that we have more or less come as far as we can with the level of math and coding that we have used so far.

Part 2: Crazy math, beautiful code

When you think about it, it almost seems a little backwards to call linear algebra the starting point of machine learning. After all, we’ve come this far without it. Multivariable calculus strikes me as more fundamental, although I suppose that it might be hard to imagine the output of a two-variable function as a bowl- or mug-shaped object without the concept of vectors.

In any case, it’s time for that Zen uppercut.

I won’t go into the mechanics of vector and matrix operations here; they are too tedious to write about, and I’m certainly not the best person to explain them. What I’m more interested in is the concept of vectorization: the “translation” (pun intended) of the algebra and calculus above into linear algebra and multivariable calculus, as well as what that looks like in Python (using NumPy).

Vectorize all the things!!!

import numpy as np

Ninja mode activated! (That was easy, eh?) First of all, let’s convert all lists to vectors and all lists of lists to matrices.

$$ \begin{aligned} \textbf X = \begin{bmatrix} x_1^1 & x_2^1 & \dots & x_n^1 \\[0.5em] x_1^2 & x_2^2 & \dots & x_n^2 \\[0.5em] \vdots & \vdots & \ddots & \vdots \\[0.5em] x_1^m & x_2^m & \dots & x_n^m \end{bmatrix} \qquad{} & \vec θ = \begin{bmatrix} θ_0 \\ θ_1 \\ θ_2 \\ \vdots \\ θ_n \end{bmatrix} & \vec y = \begin{bmatrix} y_0 \\ y_1 \\ \vdots \\ y_m \end{bmatrix} \end{aligned}$$

Let’s use scikit-learn to generate a suitable dataset for us. Since we won’t have to do any of the computations by hand, let’s go wild with the number of features and samples. We also need a vector with our initial weights (all zeros).

import sklearn.datasets
X, y = sklearn.datasets.make_regression(n_samples=500, n_features=5)
weights = np.zeros((X.shape[1], 1)) # as many rows as X has columns, and 1 column

One important thing to note is that the dummy feature $x_0$ needs to be added to the data. Also, scikit-learn generates a y array that doesn’t have the proper dimensions of a vector for some reason.

X = np.hstack((np.ones((X.shape[0], 1)), X)) # padding for bias column
y = np.vstack(y) # this just fixes a quirk of sklearn's output

$$ \begin{aligned} \textbf X = \begin{bmatrix} \textcolor{red}1 & x_1^1 & x_2^1 & \dots & x_n^1 \\[0.5em] \textcolor{red}1 & x_1^2 & x_2^2 & \dots & x_n^2 \\[0.5em] \textcolor{red}\vdots & \vdots & \vdots & \ddots & \vdots \\[0.5em] \textcolor{red}1 & x_1^m & x_2^m & \dots & x_n^m \end{bmatrix} \qquad{} & \vec θ = \begin{bmatrix} θ_0 \\ θ_1 \\ θ_2 \\ \vdots \\ θ_n \end{bmatrix} \end{aligned} $$

The hypothesis function $h_θ(x^i)$ can now be written succinctly as the product of the “houses” matrix $\textbf X$ and the weights vector $\vec θ$.

$$ \begin{aligned} \textbf X\vec θ &= \begin{bmatrix} \textcolor{teal}{x_0^1} & \textcolor{teal}{x_1^1} & \textcolor{teal}{x_2^1} & \dots & \textcolor{teal}{x_n^1} \\[0.5em] \textcolor{teal}{x_0^2} & \textcolor{teal}{x_1^2} & \textcolor{teal}{x_2^2} & \dots & \textcolor{teal}{x_n^2} \\[0.5em] \textcolor{teal}\vdots & \vdots & \vdots & \ddots & \vdots \\[0.5em] \textcolor{teal}{x_0^m} & \textcolor{teal}{x_1^m} & \textcolor{teal}{x_2^m} & \dots & \textcolor{teal}{x_n^m} \end{bmatrix} \begin{bmatrix} \textcolor{magenta}{θ_0} \\ \textcolor{magenta}{θ_1} \\ \textcolor{magenta}{θ_2} \\ \vdots \\ \textcolor{magenta}{θ_n} \end{bmatrix} \\ &= \begin{bmatrix} \textcolor{magenta}{θ_0}\textcolor{teal}{x_0^1} & \textcolor{magenta}{θ_1}\textcolor{teal}{x_1^1} & \textcolor{magenta}{θ_2}\textcolor{teal}{x_2^1} & \dots & \textcolor{magenta}{θ_n}\textcolor{teal}{x_n^1} \\[0.5em] \textcolor{magenta}{θ_0}\textcolor{teal}{x_0^2} & \textcolor{magenta}{θ_1}\textcolor{teal}{x_1^2} & \textcolor{magenta}{θ_2}\textcolor{teal}{x_2^2} & \dots & \textcolor{magenta}{n_2}\textcolor{teal}{x_n^2} \\[0.5em] \vdots & \vdots & \vdots & \ddots & \vdots \\[0.5em] \textcolor{magenta}{θ_0}\textcolor{teal}{x_0^m} & \textcolor{magenta}{θ_1}\textcolor{teal}{x_1^m} & \textcolor{magenta}{θ_2}\textcolor{teal}{x_2^m} & \dots & \textcolor{magenta}{θ_n}\textcolor{teal}{x_n^m} \end{bmatrix} \end{aligned} $$

Here is the generalized form of linear regression before and after vectorization, followed by the vectorized NumPy version.

Notice how the non-vectorized version inherently refers to only one sample at a time, with the superscript $i$. This implies the equation is true for each $i$, but the vectorized version automatically includes every sample at the same time without us even having to know how many there are. So let’s use the plural hypotheses to name the result.
Look at how simple the math expression and the code become! (Of course, it does rely on a sufficient understanding of matrix multiplication.)

$$ \begin{aligned} h_θ(x^i) &= \textcolor{magenta}{θ_0}\textcolor{teal}{x_0^i} + \textcolor{magenta}{θ_1}\textcolor{teal}{x_1^i} + \cdots + \textcolor{magenta}{θ_n}\textcolor{teal}{x_n^i} \\ h_θ(\textbf X) &= \textbf X\vec θ \end{aligned} $$

hypotheses = X @ theta # @ is short for matrix multiplication

Vectorizing the cost function: So fresh, so clean

This makes it easy to express the error as the difference between the hypothesis and the actual value:

$$ \begin{aligned} \textrm{error} &= \hat{Y}_i - Y_i \\ &= h_θ(x^i) - y_i \\ \vec e &= \textbf X\vec θ - \vec y \end{aligned} $$

errors = hypotheses - y

Recall that the cost function involves the sum of squared errors. In linear algebra, summation can be expressed as the product of a transposed vector of ones and a vector with the values to be summed, which struck me as a very clever manipulation.

$$ \begin{aligned} \vec o = \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} & \quad{} \vec e = \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_m \end{bmatrix} = \begin{bmatrix} h_θ(x^1) - y_1 \\ h_θ(x^2) - y_2 \\ \vdots \\ h_θ(x^m) - y_m \end{bmatrix} \end{aligned} $$

$$ \begin{aligned} \vec o^T\vec e &= \begin{bmatrix} 1 & 1 & \cdots & 1 \end{bmatrix} \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_m \end{bmatrix} \\ &= \textcolor{lightgray}1 \cdot e_1 + \textcolor{lightgray}1 \cdot e_2 + \cdots + \textcolor{lightgray}1 \cdot e_m = \sum_{i=1}^m e_i \end{aligned} $$

The cost function then becomes:

$$ \begin{aligned} J(θ) &= \frac{1}{2m}\sum_{i=1}^m \Big[h_θ(x^i) - Y_i\Big]^2 \\ &= \frac{1}{2m}\sum_{i=1}^m {(e_i)}^2 \\ &= \frac{1}{2m} \vec o^T \vec e^2 \end{aligned} $$

num_samples = X.shape[0]
o = np.ones((X.shape[0], 1))
cost = (o.T @ np.square(errors)) / (2 * num_samples)

Note that the NumPy function np.square(z) is faster than z ** 2:

In [13]: rando = np.random.rand(500000,500) * 100 # just a random big dataset for testing

In [14]: %timeit np.square(rando)
1.2 s ± 7.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [15]: %timeit rando ** 2
1.19 s ± 7.01 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

and o.T @ np.square(z)) (that is, $\vec o^T\textbf Z^2$) blows sum(z ** 2) (that is, $\sum_{i=1}^m (z_i)^2$) out of the water:

In [17]: %timeit o.T @ np.square(rando)
1.36 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %timeit np.sum(np.square(rando))
1.37 s ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [19]: %timeit sum([sum(row) for row in rando]) # because rando has more than 1 column
25.1 s ± 603 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

At this point, it’s helpful to turn our cost function into a Python function that takes $\textbf X, \vec y, \vec θ$ as its inputs. This will make it easier to evaluate our model later.

def cost(X, y, weights):
    num_samples = X.shape[0]
    o = np.ones((num_samples, 1))
    errors = X @ weights - y
    cost_array = (o.T @ (np.square(errors))) / (2 * num_samples) # 1x1 matrix
    return cost_array[0][0] # plain number

Vectorizing the gradient: Ninja mode on overdrive

On to the gradient. This is where linear algebra really kicks this thing into high gear.

(This is also where I get to show off my LaTeX chops.)

$$ \begin{aligned} \nabla J = \frac{\partial J(θ)}{\partial θ_j} &= \Bigg\{\frac{\partial J(θ)}{\partial θ_0}, \frac{\partial J(θ)}{\partial θ_1}, \cdots, \frac{\partial J(θ)}{\partial θ_n}\Bigg\} \\ &= \frac{1}{m} \sum_{i=1}^m \Big[h_θ(x^i) - Y_i\Big]^2 \cdot x_j^i \qquad{} \textrm{for } 0 \leq j \leq n \\ &= \frac{1}{m} \sum_{i=1}^m {(e_i)}^2 x_j^i \qquad{}\qquad{}\qquad{}\qquad{}\qquad{} " \\ &= \frac{1}{m} \sum_{i=1}^m x_j^i {(e_i)}^2 \qquad{}\qquad{}\qquad{}\qquad{}\qquad{} " \\ &= \Bigg\{ \textcolor{teal}{ \frac{1}{m} \sum_{i=1}^m x_0^i {(e_i)}^2, \frac{1}{m} \sum_{i=1}^m x_1^i {(e_i)}^2, \cdots, \frac{1}{m} \sum_{i=1}^m x_n^i {(e_i)}^2 }\Bigg\} \end{aligned} $$

Transposing $\textbf X$ and squaring $\vec e$ gives us:

$$ \begin{aligned} \textbf X^T = \begin{bmatrix} x_0^1 & x_0^2 & \dots & x_0^m \\[0.5em] x_1^1 & x_1^2 & \dots & x_1^m \\[0.5em] \vdots & \vdots & \ddots & \vdots \\[0.5em] x_n^1 & x_n^2 & \dots & x_n^m \end{bmatrix} & \quad{} \vec e^2 = \begin{bmatrix} {(e_1)}^2 \\[0.5em] {(e_2)}^2 \\[0.5em] \vdots \\[0.5em] {(e_m)}^2 \end{bmatrix} \end{aligned} $$

$$ \textbf X^T \vec e^2 = \begin{bmatrix} x_0^1 {(e_1)}^2 + x_0^2 {(e_2)}^2 + x_0^3 {(e_3)}^2 + \cdots + x_0^m {(e_m)}^2 \\[0.5em] x_1^1 {(e_1)}^2 + x_1^2 {(e_2)}^2 + x_1^3 {(e_3)}^2 + \cdots + x_1^m {(e_m)}^2 \\[0.5em] \vdots \\[0.5em] x_n^1 {(e_1)}^2 + x_n^2 {(e_2)}^2 + x_n^3 {(e_3)}^2 + \cdots + x_n^m {(e_m)}^2 \end{bmatrix} = \begin{bmatrix} \sum_{i=1}^m x_0^i{(e_1)}^2 \\[0.5em] \sum_{i=1}^m x_1^i{(e_1)}^2 \\[0.5em] \vdots \\[0.5em] \sum_{i=1}^m x_n^i{(e_1)}^2 \\ \end{bmatrix} $$

Notice how multiplying this result by $\frac{1}{m}$ gives us a vector containing the same values highlighted above in teal.

$$ \frac{1}{m} \textbf X^T \vec e^2 = \frac{1}{m} \begin{bmatrix} \sum_{i=1}^m x_0^i{(e_1)}^2 \\[0.5em] \sum_{i=1}^m x_1^i{(e_1)}^2 \\[0.5em] \vdots \\[0.5em] \sum_{i=1}^m x_n^i{(e_1)}^2 \\ \end{bmatrix} = \begin{bmatrix} \textcolor{teal}{ \frac{1}{m} \sum_{i=1}^m x_0^i{(e_1)}^2 } \\[0.5em] \textcolor{teal}{ \frac{1}{m} \sum_{i=1}^m x_1^i{(e_1)}^2 } \\[0.5em] \textcolor{teal}{ \vdots } \\[0.5em] \textcolor{teal}{ \frac{1}{m} \sum_{i=1}^m x_n^i{(e_1)}^2 } \\ \end{bmatrix} = \begin{bmatrix} \frac{\partial J(θ)}{\partial θ_0} \\[0.5em] \frac{\partial J(θ)}{\partial θ_1} \\[0.5em] \vdots \\[0.5em] \frac{\partial J(θ)}{\partial θ_n} \end{bmatrix} $$

Astonishingly, that gigantic mess can be expressed as

$$ \nabla J = \frac{1}{m} \textbf X^T \vec e^2 $$

num_samples = X.shape[0]
gradient = (X.T @ np.square(errors)) / num_samples

Finally, we can work out the last function, the gradient descent function:

$$ \begin{gathered} \vec θ := \vec θ - α\frac{1}{m} \textbf X^T \vec e^2 \\ \textrm{repeat until convergence} \end{gathered} $$

Hey, where’d that $α$ come from? That’s the learning rate, a small number that adjusts the size of each training step. Too large and you jump right over the minimum; too small and you never reach the minimum.

For now, let’s choose an arbitrary value for $α$ and disregard the whole bit about convergence. If we wanted to perform stochastic gradient descent with 100 steps, this is how we’d do it:

for i in range(100):
    learning_rate = 0.01
    errors = X @ weights - y
    num_samples = X.shape[0]
    step_distance = (X.T @ errors) / num_samples
    weights -= learning_rate * step_distance

Let’s refactor this as a function train_lr_model that takes $\textbf X, \vec y$, and the number of steps (training epochs) as its inputs, and outputs the weights $\vec θ$. Along the way, let’s have it tell us the cost. If all goes well, we should see that number approach zero as training progresses.

def train_lr_model(X, y, epochs, learning_rate=0.01):
    weights = np.zeros((X.shape[1], 1))
    num_samples = X.shape[0]
    for i in range(epochs):
        errors = X @ weights - y
        step_distance = (X.T @ errors) / num_samples
        weights -= learning_rate * step_distance
        print(cost(X, y, weights)[0][0]) # dot products are single values, but NumPy returns them as 1x1 matrices
    return weights

weights_300_epochs = train_lr_model(X, y, 300)

Now, we can predict the output for a random set of feature values.

def predict(X, weights):
    return X @ weights

mystery_input = np.random.rand(1, X.shape[1] - 1)
mystery_input = np.hstack((np.ones((1, 1)), mystery_input)) # dummy feature
predict(mystery_input, weights_300_epochs)

I like to geek out on math notation and wrote this function to generate LaTeX for the equation of the model:

def expand_model_latex(weights):
    terms = [f"{round(w[0], 2)}x_{i}" for i, w in enumerate(weights)]
    print(f"$$ h_θ(x) \approx {' + '.join(terms)} $$")

$$ h_θ(x) \approx -2.36x_0 + 63.26x_1 + 61.15x_2 + 88.99x_3 + 0.82x_4 + 58.95x_5 $$

We’re not done with linear regression, but let’s recap so that this post can end.

Summary so far

	Broke	Woke
Linear regression model	$$h_θ(x) = θ_0x_0 + θ_1x_1 + \cdots + θ_nx_n$$	$$h_θ(\textbf X) = \textbf X\vec θ$$
Cost function	$$J(θ) = \frac{1}{2m}\sum_{i=1}^m{\Big[h_θ(x^i) - y_i\Big]}^2$$	$$ \begin{gathered} \vec e = h_θ{\textbf X} - \vec y \\ J(θ) = \frac{1}{2m}\vec o^T\vec e^2 \end{gathered} $$
Gradient of cost function	$$ \frac{\partial J(θ)}{\partial θ_j} = \frac{1}{m}\sum_{i=1}^m\Big[h_θ(x^i) - y_i\Big]x_j^i $$	$$ \nabla J = \frac{1}{m}\textbf X^T\vec e $$
Gradient descent function	$$ θ_j := θ_j - α\frac{\partial J(θ)}{\partial θ_j} $$	$$ \vec θ := \vec θ - α\nabla J $$

Such complex math can be applied to a linear regression model trained in some 20 lines of Python!

import numpy as np
import sklearn.datasets

X, y = sklearn.datasets.make_regression(n_samples=500, n_features=5)

X = np.hstack((np.ones((X.shape[0], 1)), X)) # padding for bias column
y = np.vstack(y) # this just fixes a quirk of sklearn's output

def cost(X, y, weights):
    num_samples = X.shape[0]
    o = np.ones((num_samples, 1))
    errors = X @ weights - y
    cost_array = (o.T @ np.square(errors)) / (2 * num_samples)
    return cost_array[0][0]

def train_model(X, y, epochs, learning_rate=0.01):
    weights = np.zeros((X.shape[1], 1))
    num_samples = X.shape[0]
    for i in range(epochs):
        errors = X @ weights - y
        step_distance = (X.T @ errors) / num_samples
        weights -= learning_rate * step_distance
    print(f"Completed {epochs} epochs of training.")
    print(f"Final cost: {cost(X, y, weights)}")
    return weights

def predict(X, weights):
    return X @ weights

Comparison of performance

This is a contrived comparison, but it helps to illustrate very clearly the point of jumping through all of these mathematical hoops.

On the 500-sample, 5-feature dataset we’ve been using, the vectorized gradient descent function runs over 7,000 times faster than the terrible, monstrous, don’t-say-I-didn’t-warn-you procedural version from the beginning of this post.

In [14]: %timeit train_lr_model_procedurally(X, y, 200, 0.00001)
14.2 s ± 609 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [19]: %timeit train_lr_model_vectorizedly(X, y, 200, 0.00001)
2.28 ms ± 64.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Remaining questions for future posts (links will be included upon publication)

Are there any other simple techniques to speed this up? (Mean normalization)
Are there any complex techniques to speed this up? (Automatic differentiation)
How do we know when we’ve reached convergence? (Epsilon)
How can linear (as opposed to logistic) regression be applied to language tasks? (General backpropagation of neural networks)

References

The Essence of Linear Algebra (video series), Grant Sanderson
Linear Regression using Python, Animesh Agarwal
Linear Regression with Multiple Variables, Ritchie Ng
Understanding and Calculating the Cost Function for Linear Regression, Lachlan Miller
The Linear Regression Cost Function in Matrix Form, Anwar Ruff
Multivariable chain rule, simple version, Khan Academy

Go Top