ML Lesson 4¶

$$ \Huge \hat y = W_2 \cdot \text{ReLU}(W_1 \cdot x + b_1) + b_2 $$

To set up your code on Mac or Linux, open the terminal and run:

curl -LsSf https://h.sherstnev.org/setup4.sh | sh

On Windows, open powershell and run

powershell -ExecutionPolicy ByPass -c "irm https://h.sherstnev.org/setup4.ps1 | iex"

Summary¶

We will do something totally different today. Instead of training a new model, we are going to remake the first model we trained; except this time without PyTorch or any other ML library. We will do all the math to train a neural net from scratch.

This will be harder than previous lessons, but you should understand almost the full picture after this!

Vector¶

A vector is a list of numbers. I name vectors with lowercase letters.

$$ x = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}, \quad y = \begin{bmatrix}y_1 & y_2 & y_3 \end{bmatrix}. $$

As you can see, vectors can be rows or columns, and they behave a little differently. You'll see how later when we look at matrices. Mostly we'll use column vectors.

We write the shape of these two vectors as $x : (3)$ and $y: (1,3)$, to mean $(\text{Rows}, \text{Columns})$.

You can add vectors together which add the components $$ \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} + \begin{bmatrix} y_1 \\ y_2 \\ y_3 \end{bmatrix} = \begin{bmatrix} x_1 + y_1 \\ x_2 + y_2 \\ x_3 + y_3 \end{bmatrix}. $$

And you can multiply a vector by a number (called a scalar) to a scale each number in the vector (called a component) by that amount

$$z \cdot \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}= \begin{bmatrix} x_1 z \\ x_2 z \\ x_3 z \end{bmatrix}. $$

We define an operation called the inner product or dot product between two vectors.

$$ \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} \cdot \begin{bmatrix} y_1 \\ y_2 \\ y_3 \end{bmatrix} = x_1 y_1 + x_2 y_2 + x_3 y_3. $$

We will show in class how you can interpret the dot product as a measure of similarity.

Matrix¶

A matrix is a table of numbers. I name matrices with capital letters.

$$ M = \begin{bmatrix} 1 & 2 \\ 3 & 4\end{bmatrix} $$

You can multiply a matrix by a vector

$$ \begin{bmatrix} x_1 & x_2 \\ y_1 & y_2 \end{bmatrix} \cdot \begin{bmatrix} z_1 \\ z_2 \end{bmatrix} = \begin{bmatrix} x_1 z_1 + x_2 z_2 \\ y_1 z_1 + y_2 z_2 \end{bmatrix}. $$

To multiply a matrix by a vector, take the $n$th row of the matrix, and take the inner product between it and the vector. That becomes the $n$th component of the resulting vector.

You can also multiply matrices. Take the $n$th row of the matrix on the left, and dot it with the $m$th column of the matrix on the right. This becoems the $n,m$ component of the resulting matrix.

$$ \begin{bmatrix} x_1 & x_2 \\ y_1 & y_2 \end{bmatrix} \cdot \begin{bmatrix} z_1 & w_1 \\ z_2 & w_2 \end{bmatrix} = \begin{bmatrix} x_1 z_1 + x_2 z_2 & x_1 w_1 + x_2 w_2 \\ y_1 z_1 + y_2 z_2 & y_1 w_1 + y_2 w_2 \end{bmatrix}. $$

Remember, matrices can be any shape $(\text{Rows},\text{Columns})$. The examples are just $(2,2)$.

When you multiply an $(A,B)$ matrix by a $(X,Y)$ matrix, the result has shape $(A, Y)$. Notice that the order matters. $AB \ne BA$. Matrix multiplication is not commutative.

We can transpose a matrix $M^\top$ to swap the rows and columns.

Vector valued function¶

You're probably used to functions that take one number as input and produce one output $$ f(x) = 2x^7 - 3. $$

But we can just as well have functions that take multiple inputs

$$ f(x,y) = \frac{x}{y} - y^3. $$

We can also have functions that produce a vector or matrix, not just a number

$$ f(x,y) = \begin{pmatrix} x + 1 \\ y - 1 \end{pmatrix} $$

And functions that take vectors or matrices as input

$$ f(M, x) = M^\top x.$$

Partial Derivative¶

The partial derivative $\frac{\partial f}{\partial x}$ of a function $f(x,y)$ with respect to a variable $x$ is the rate at which that function changes with small variations in $x$. If you are familiar with derivatives of single variable functions $\frac{df}{dx}$, (for our purposes) a partial derivative is basically the same thing: just treat the other variables as constants.

When it's clear what we mean, we can write $f'(x)$ to mean $\frac{\partial f}{\partial x}$.

I won't reteach calculus. But the most important rule you need to know is the chain rule:

$$ \frac{\partial}{\partial x} f(g(x)) = f'(g(x)) g'(x). $$

This generalizes to any number of nested functions.

Gradient¶

The gradient $\nabla f(x,y,z)$ is a vector that contains the partial derivatives of $f$ with respect to each of its inputs

$$ \nabla f(x,y,z) = \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \\ \frac{\partial f}{\partial z} \end{bmatrix} . $$

Since the gradient is a vector, it has a direction. The gradient points in the direction of increasing $f$.

"Homework"¶

Earlier we used PyTorch to write models, and of course it was a lot easier than doing it manually like we did here. What is PyTorch actually doing?

At it's core, PyTorch is an automatic differentiation framework, with a bunch of machine learning functions on top. PyTorch can take a normal Python function and differentiate it, like we did on paper today. As you can imagine this saves a lot of time.

The code we've been writing with functions like nn.Sequential, nn.Linear, etc, is just python code that PyTorch runs and then differentiates to update the weights.

As we move forward, we will need more of PyTorch to design more complicated networks, like the one in the RNN reading you did for today. Now that we have learned the math and done it on paper once, we can express networks as mathematical functions, and use PyTorch to optimize their parameters. This is incredibly powerful.

For Tuesday, please read: The Fundamentals of Autograd / watch the video.

Also, if you want a real challenge, modify the manual neural network code we wrote to do one of the following (or something else):

Write a better optimization function. Right now we are likely to get stuck in local minima. Read up on different optimizers (Adam, etc) and borrow some ideas.
1. As a simple start, see if you can vary the learning rate depending on how training is going
Add another layer. You will need to do more calculus to compute the gradients, but this should improve the model accuracy
Implement a convolutional neural network like we did in lesson 3 by hand. This will probably be a lot of work. I will be very impressed if you get it working!