A deep dive into Linear Regression

12 minute read

Published:

Motivation

Brilliant another blog post on linear regression, how creative! Why would I make this post (I hear no one ask)? Having now dabbled in a few large ML/Statistics projects during my undergrad, I have begun to realise how important it is to have a very strong foundational understanding of core concepts, before trying to undertake more complex ones. So I put my ego and my desire to dive into all the new flashy machine learning models aside, and went to the first chapter of any book on statistical learning. Linear regression.

I tasked myself with the challenge of building a linear regression model from scratch in Python (no scipy allowed!). “This will be easy” I thought. But as I started to read more into Linear Regression and began to build out the model, I realised how shallow my knowledge of Linear Regression actually was.

In this post I will take you along my journey to “re-learning” linear regression from scratch. I hope to present it in a way that may be different to how you were taught it at university (or at least how I was), and detail my intial pitfalls when attempting to build out my own mock Scipy linregress function.

Theory

The aim of this section is to give a succinct overview of everything (I beleive) a student, like myself, should know about linear regression. This is not a textbook; I will not be extrenely rigorous in definining notation, if you want to dive deeper, I reccommend checking out the resources at the end.

What even is (linear) regression?

If I had to sum up linear regression in one sentence, I’d say this.

Linear regression encapsulates a class of functions that model the relationship between inputs and quantitative targets using a linear combination of functions of the input.

So what does all of this mean?

In statistics and machine learning, regression refers to any task where the goal is to predict a quantitive value (this could be a scalar or a vector) given a corresponding set of inputs. The term “linear” in linear regression, simply means our model is a linear function of the parameters (ie. weights), however there is no requirement that we are linear with respect to our input vector, \(\textbf{x}\). More formally, we express our predicted output as a weighted sum of basis functions evaluated at the input; these models in the form

\[f(\textbf{x};\theta)=\theta_0+\sum ^{d-1} _{i=1} \theta_i \phi_i (\textbf{x})\tag{1}\]

where

  • \(\textbf{x} \in \mathcal X\) is our input. This is usually a real valued vector.
  • \(\theta=(\theta_0,...,\theta_{d-1})^T\) is our \(d\)-dimensional vector containing the model weights.
  • \(\phi_i : \mathcal X \to \mathbb R\) are our basis functions of \(\textbf{x}\) which map from the input space to a real valued scalar.

By concatenating all \(\phi_i\) into a vector, \(\phi\in \mathbb R^{d-1}\), called the feature vector, and letting \(\phi_0(\textbf{x})=1\), we can express Equation 1 as

\[f(\textbf{x};\theta)= \phi (\textbf{x})^T \theta.\]

This may seem foreign if you’ve only ever encountered the standard linear regression model, which is written as

\[f(\mathbf{x}; \theta) = \theta_0 + \sum_{i=1}^{d-1} \theta_i x_i. \tag{2}\]

However, Equation (2) is equivalent to Equation (1) when our basis functions are simply the linear basis functions, \(\phi_j(\mathbf{x}) = x_j\) for \(j=1,...,d-1\).

By extending this class of functions by using non-linear basis functions we can model more complex relationships between the inputs and the target variable. These basis functions could be polynomials, trigonometric functions, interactions between features or a numeric coding of qualitative inputs.

Key Takeaway 1: Linear regression is so much more than just simple linear regression; the model must only be a linear function of the parameters, \(\theta\), not neccessarily a linear function of \(\textbf{x}\).

Maximum Likelihood and Least Squares Framework

  • TODO: REWRITE Instead of introducing the residual sum of squares and minimising it with respect to \(\theta\), which I am sure you have seen before, I will introduce the notion of loss functions and define empirical risk. This is incredibly similar to the standard “minimising the RSS” approach but I beleive it outlines a more general frameowrk for how we can optimise supervised learning models and provides a better intuition as to where the normal equation comes from.

Minimising the Empirical Risk

We are interested in choosing suitable weights, \(\theta\in \mathbb R\), such that we minimise the difference between the true and predicted output. We can achieve this by specifying and minimising a given loss function, \(\ell : \mathcal Y\times \mathcal Y\to \mathbb R\). The loss function \(\ell(y,z)\) quantifies the loss from predicting \(z\) when the true value is \(y\). There are various loss functions that could be chosen for regression tasks. One common loss function is the squared loss, \(\ell(y,z)=(y-z)^2\).

Suppose we have labelled data \((\textbf{x}_1,y_1), ..., (\textbf{x}_n, y_n)\in \mathcal X \times \mathcal Y\). For the sake of simplicity let each \(y_i\) be a real valued scalar. We are interested in minimising the average error over the data, known as the empirical risk,\(\hat{\mathcal R}\), with respect to our model weights \(\theta\),

\[\hat{\mathcal R}(f_{\theta}) = \frac{1}{n}\sum ^n _{i=1} \ell (y_i, f(\textbf{x}_i;\theta)).\]

The \(\hat{\theta}\in \text{arg min}_{\theta\in \Theta} \hat{\mathcal R}(f_{\theta})\) is known as the least squares estimator.

Least Squares Framework

In the context of our linear regression model, we

  • Define \(\Phi\in \mathbb R^{n\times d}\), called the design matrix, where each row is the feature vector for one observation. Note that \(\phi_0 (\textbf{x}_i)=1\) for any \(i\in n\)
\[\Phi = \begin{pmatrix}1 & ... & \phi_{d-1}(\textbf{x}_1) \\ ... & & ...\\ 1 & ... & \phi_{d-1}(\textbf{x}_1) \end{pmatrix},\]
  • And specify that \(\textbf{y}=(y_1,...,y_n)^T\in \mathbb R^n\) is a vector of target outputs for each input vector, where the \(i\)-th element \(y_i\) is the corresponding target for input \(\textbf{x}_i\).

Hence, we are interested in minimising the empirical risk using the squared loss, which we can write in matrix notation as

\[\hat{\mathcal R}(\textbf{w})=\frac{1}{n}||\textbf{y}-\Phi \theta||^2_2, \tag{3}\]

where \(\vert \vert \cdot \vert \vert_2^2\) is the squared \(\ell_2\)-norm.

Ordinary Least Squares (OLS)

If \(\Phi\in \mathbb R^{n\times d}\) has full column rank, which means that the column vectors \(\phi _i(\textbf{x})\) are linearly independent, then \(\hat{\theta}\) is known as the ordinary least squares estimator. As \(\Phi\) has full column rank this implies that the number of features, \(d\), must be less than or equal to the number of data points,\(n\), ie. \(d\le n\). If this isn’t immediately obvious to you, ask yourself the question “What is the maximum number of linearly independent vectors we can have in an \(n\)-dimensional space?”.

Given \(\Phi\) has full column rank, we now want to show that the OLS estimator in Equation 3 exists and has a unique representation. While, we could just bulldoze ourselves through this and just assume it exists and is unique by making some hand-wavey statements, that’s not what this blog is about. We are learning from first principles, so let’s get stuck in.

  1. Does a minimiser exist? To answer this, let’s review a theorem you may have come across in Real Analysis

    Extreme Value Theorem: A continuous real valued function \(f\) on a closed and bounded (ie. compact) interval \([a,b]\) must obtain both a maximum and minimum, at least once 1.

First things first, if we expand Equation 3 out we get,

\[\hat{\mathcal R}(\textbf{w})=\frac{1}{n}(\textbf{y}-\Phi \theta)^T(\textbf{y}-\Phi\theta).\]

which is quadratic in \(\theta\in \mathbb R^d\), hence it is continuous on \(\mathbb R^d\).

If you have never taken a convex analysis/optimisation course (like myself) you may be scratching your chin about how we use the extreme value theorem considering we are working with an unbounded set, in this case \(\mathbb R^d\), to prove the existence of a minimiser.

In order to approach this, we can drop the condition of boundedness, if we can show that our function, \(\hat{\mathcal R}\), is a coercive function.

If \(f\) is a continuous real-valued function on a closed and unbounded set, then \(f\) has a global minimiser. If the first derivatives of \(f\) exist, then the global minimisers are among the critical points of \(f\).

First let’s define what it means for a function to be coercive.

A continuous function on an unbounded set, \(S\subset \mathbb R^d\) is coercive if 2

\[\lim _{||x||\to \infty } f(x)=+\infty\]

To prove that \(\hat{\mathcal R}\) is coercive in \(\theta\) we must show that

\[\lim _{||\theta||\to \infty} \frac{1}{n}||\textbf{y}-\Phi \theta||^2_2.\]

….

Expanding Equation 4 this gives,

\[\hat{\mathcal R}(\textbf{w}) = \frac{1}{n}(||\textbf{y}||_2^2-2\theta ^T \Phi ^T\textbf{y}-\theta ^T \Phi ^T \Phi \theta),\]

and

\[\hat{\mathcal R}'(\theta)= \frac{2}{n}(\Phi ^T \Phi \theta -\Phi ^T \textbf{y}).\]

Solving \(\hat{\mathcal R}(\theta)=0\) gives the unique solution

\[\hat{\theta}= (\Phi^T\Phi)^{-1}\Phi ^T\textbf{y}.\tag{5}\]

To prove that this global minimiser in Equation 5 is unique, we can show it is strictly convex…show Hessian >0

this, if not a simple Google search would do b) the idea is rather intutitve and the proof doesn’t add much aside from an appreciation of real analysis.

Geometric Interpretation

This leads to a nice geometric interpretation of the least squares solution. Consider an \(N\)-dimensional subspace. Our vector \(\textbf{y}\) containing the true values is a vector in this space. Additionally, each column vector, \(\phi_j(\textbf{x})\) for \(j\in M\), of \(\Phi\) will be a vector in this $N$-dimensional subspace.

The first thing to note is that, given \(\Phi\) has full column rank, then our set of \(M\) column vectors, \(\phi_j(\textbf{x})\), will span a linear subspace, \(\mathcal S\), of dimensionality \(M\). These linearly indepedent column vectors will define the basis for this subspace, \(\mathcal S\), which is the column space of \(\Phi\).

Key Takeaway 3: The set of basis functions for our linear model forms an \(M\) dimensional subspace within \(\mathbb R^N\) where all possible model outputs live.

Intuitively, the best model prediction \(\hat{\textbf{y}}\) is the one that is closest to the true output vector \(\textbf{y}\) in terms of Euclidean distance.

This occurs when the residual vector, which is the difference \(\textbf{y} - \hat{\textbf{y}}\), is orthogonal to the subspace \(\mathcal{S}\) of all possible model outputs. More formally, the best model output \(\hat{\textbf{y}}\) is the orthogonal projection of \(\textbf{y}\) onto the subspace \(\mathcal{S}\).

To understand why this is the case, we must answer the following question:

Why is the vector in the subspace whose difference from our target vector is orthogonal to the subspace the one that minimises the distance to the target? TODO In order to find the orthogonal projection of \(\textbf{y}\) onto the column space of \(\Phi\), we… TODO

Probabilistic Interpretation of Linear Regression

If you have taken some undergraduate statistics courses, you will be familiar with this explanation, however personally, this explanation has never fully satisfied me. My main gripe with this explanation is, the use of the squared loss. “Why not the absolute loss?”. Usually, you’ll get some generic response back, such as, it’s easier to compute the derivative for the squared loss as it’s differentiable everywhere, however I was never fully convinced by this rationale. To those like myself, I will not present a probabilistic interpretation of linear regression.

For this interpretation we will assume that inputs are not randomly generated; they are fixed. However, the output labels are random. We are interested in minimising the expected empirical risk,

\[\mathbb E _y\biggr[\frac{1}{n} ||\textbf{y}-\Phi\theta||_2^2\biggr]\]

To do so;

  • We assume, \(f(\textbf{x};\theta)\) in Equation 2 is deterministic, and again that \(\Phi\) is of full column rank.
  • That the relationship between our input and targets can be modelled as
\[y_i=f(\textbf{x}_i;\theta)+\epsilon_i\]

where \(\epsilon _i\) are independent Gaussian random variables with zero mean and fixed variance for \(i\in \{1,...,n\}\). This is equivalent to the condition that \(\epsilon_i = \textbf{y}_i-f(\textbf{x};\theta)\sim \mathcal N(0,\sigma^2)\)

Here, the \(\epsilon \in \mathbb R ^n\) is a vector to account for the variabilities…

As \(f(\textbf{x};\theta)\) is deterministic, using the independence of \(\epsilon _i\), we can express,

\[p(\textbf{y}| \theta, \sigma^2) = \prod ^N_{i=1}\mathcal N(y_i|f(\textbf{x}_i;\theta),\sigma^2) \tag{4}.\]

Taking the logarithm of Equation 4, we obtain the log likelihood,

\[\begin{align} \ln p(\textbf{y}|\theta, \sigma^2) &= \sum^N_{i=1} \ln \mathcal N (y_i|\phi(\textbf{x}_i ;\theta),\sigma^2) \\ &= \sum^N_{i=1} \ln \biggr(\frac{1}{\sqrt{2\pi \sigma^2}} \exp \biggr(- \frac{1}{2\sigma^2}(y_i-\phi(\textbf{x}_i)\theta^T)^2 \biggr)\biggr) \\ &=-\frac{N}{2}\ln \biggr(2\pi\sigma^2 \biggr) -\frac{1}{2\sigma^2}||\textbf{y}-\Phi \theta||^2_2. \tag{5} \end{align}\]

Taking the derivative of Equation 5 with respect to \(\theta\) and solving for zero, we find the MLE is

\[\hat{\theta}=(\Phi^T\Phi)^{-1}\Phi^T \textbf{y},\]

which is the same estimate as before.

Key Takeaway: Assuming \(\Phi\) is deterministic and that the relationship between the inputs and outputs can be modelled as \(f(\textbf{x};\theta)\) with additive Gaussian noise, then the OLS corresponds to the MLE.

TODO: Note about Gaussian assumption, why it isnt needed.

Caveats; What is \(\Phi\) doesn’t have full column rank?

Numerical Methods for Estimation

Building our Linear Regression Model

Problems to be aware of

Implementing In Practise

Conclusion

Further Reading / Referneces

  1. I will not prove this here, because a) if you have taken any real analysis / calculus classes you would’ve come across 

  2. Taken from a very great mini-lecture