← Home

Regularized Regression: Controlling Overfitting

Primer: Pikachu's Failed Model

Previously, Pikachu built a model to not get struck by lightning. However, while he found that his model is great at predicting from past data, everytime Pikachu goes outside, he still gets struck! Despite wonderful training results from his optimized $\theta$ values, pikachu realizes that his model is not very good at predicting new unseen data. This is overfitting.

Background

Here, I will throw out some formulas with notes.

Formula	Term	Notes
$y - \mathbb{E}[\hat{y}]$	Bias	How far is actual $y$ from average predicted $y$?
$\frac{1}{N} \sum (\hat{y} - \mathbb{E}[\hat{y}])^2 = \mathbb{E}[(\hat{y} - \mathbb{E}[\hat{y}])^2]$	Variance	How far is our predicted value from the average predicted values? Does not rely on actual y. How flexible is our model?
$\text{Bias}^2 + \text{Variance} + \epsilon$	Error	Same as the expected MSE $\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$ across multiple datasets

Source: me

The important thing to note is that error is dependent on bias and variance (more dependent on bias). High bias means underfitting and high variance means overfitting. We want to balance the two known as the bias-variance tradeoff.

Source: GeeksforGeeks

Here is a fantastic visual explanation by Amazon's MLU.

WTF is Regularization?

So what is happening to Pikachu's model? His variance is probably too high :(

Note

Usually, the more features we have, the more ways for a model to fit your training data and thus, the higher the variance.

So what is the fix? Regularization! This concept doesn't only apply to regression models. How do LLMs not overfit? Regularization.

Regularization

The idea of regularization is to increase bias by a little to bring variance down by a lot.

Source: Madhi Roozbahani (Georgia Tech)

So what does this mean? We are restricting our model. By increasing bias (more distance between $y$ and $\hat{y}$), we reduce the model's ability to match the true relationship. Since bias is sqaured, variance will drop a lot, meaning the model will become more stable. What does that mean intuitively?

Remember in a regression model: $$ y = \theta_0 + \theta_1 f_1(x) + \dots + \theta_d f_d(x) $$ Now, we need to subject $\theta$ to contraints, pulling the values of $\theta$ down (high values = high variance). Previously, to get those $\theta$ values, we could have used gradient descent. Having our objective function as the MSE = Error, Pikachu trained his model so much that he was getting very low training error. However, low training error could cause overfitting and so we are unable to use this model to generalize. Instead of dropping a ball to the minimum, lets add a constraint so that we stop at some point before the minimum. This means we knowingly increase our error so our model is more generalizable.

Source: Madhi Roozbahani (Georgia Tech)

How Regularization Works

All we need to do is add a constraint during training. Without constraining, $\theta$ becomes very large naturally. We can do this by adding a regularization (penalty) term: $$ L(\theta) = E(\theta) + \text{Regularization Term} \qquad \text{where $E(\theta)$ = MSE}\\ L(\theta) = E(\theta) + \lambda \theta^T \theta \qquad \qquad \qquad \text{$\lambda$ is a hyperparameter} $$ Since $\lambda$ is a hyperparameter, we must choose it:

A large $\lambda$ means our optimzer will assign small $\theta$ values
A small $\lambda$ means our optimzer will assign large $\theta$ values

in order to reduce penalty.

Note

$\theta^T \theta$ is the same as $\theta_1^2 + \theta_2^2 + \dots + \theta_d^2$. We do this to penalize large $\theta$ values and make optimization easy since it becomes a convex function and during gradient descent, it becomes $2\lambda \theta$. Large penalty means our objective function's error ($E(\theta) + \text{penalty}$) becomes very big so the optimzer will avoid these $\theta$ values.

So now, we have: $$ E(\theta) = \frac{1}{N}\sum_{i}^n (y_i - X\theta)^2 + \lambda ||\theta||_2^2 $$ Remember that $X\theta = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_d x_d$. Before we get confused, lets replace $x$ with $z$, to clarify that $x$ can be any polynomial function $f(x) = z$.

This is only a type of regression known as ridge regression. Here are some more:

Ridge Regression (L2 norm penalty term):

$$ E(\theta) = \frac{1}{N}\sum_{i}^n (y_i - z\theta)^2 + \lambda ||\theta||_2^2 $$

Source (left): Madhi Roozbahani (Georgia Tech)
Source (right): towardsdatascience.com

Area that the circle touches the ellipsoid (objective function ($E(\theta)$)) is a particular choice of $\theta$ values (some $\theta$ assignments are very small - dense weights).
Convex model
Both MSE, L2 are differentiable
Closed form solution exists (set $\nabla L(\theta) = 0$): $$ \theta = (z^T z + \lambda I)^{-1} z^T y $$

Lasso Regression (L1 norm penalty term):

$$ E(\theta) = \frac{1}{N}\sum_{i}^n (y_i - z\theta)^2 + \lambda ||\theta||_1 $$

Source (left): Madhi Roozbahani (Georgia Tech)
Source (right): G. Sanchez & E.Marzban

Edges touched by the ellipsoid (objective function ($E(\theta)$)) is a particular choice of $\theta$ values where all other $\theta$ values not touched are 0 (sparse weights).
Convex model
L1 is NOT differentiable
No closed form solution exists

If Lasso is not diffrentiable, how do we optimize $\theta$? Introducing: subgradient descent: $$ L(\theta) = \frac{\partial E(\theta)}{\partial \theta} = -z^T (y - z \theta) + \lambda \cdot \text{sign}(\theta) $$ $$ \text{sign($\theta$)} = \begin{cases} 1 & \theta > 0 \\ -1 & \theta < 0 \\ [-1,1] & \theta = 0 \end{cases} $$ We choose $[-1, 1]$ for $\theta = 0$ because there is no single slope at 0 so any slope from left / right works.

Conclusion

So how do we prevent overfitting? Use a regularization or penalty term. Earlier, Pikachu overfit to past data, making future predictions useless. By introducing higher error, we make our model more generalizable across new datapoints.

Source: Me :)