Intro to Deep Learning — Bias-Variance tradeoff, Regularizations

Hyejin Kim
6 min readMay 29, 2021

This article introduces the bias-variance tradeoff in deep learning. We’ll discuss first what bias-variance tradeoff is, then its implication associated with model complexity, and finally how to avoid overfitting and acquire a well-generalized model.

Bias in data measures how much the dataset is skewed from the target point. High bias means data points are far from the target point. Variance measures how divergent the data points are. High variance means the data points are widely scattered and have large deviations from the average location. Bias and variance are closely related to the model complexity in machine learning. With the simple model(i.e. when the model complexity is low) the model fails to completely capture the relationship between features in train data and bias is high while variance is low. With the complex model(i.e. when the model complexity is high) the model can overfit with train data and in this case bias is low and variance can be high.

This relationship can be mathematically derived from the mean square error between the test data and predicted values. The proof can be found here.

We aim to achieve low bias and low variance when training a machine learning model. We can use the bias-variance tradeoff to determine the strategy to achieve this goal. For example, when the model is too simple it fails to capture the relationship between the features and it will show consistent and high test error. Then we should increase the model complexity to decrease the bias. With high complexity, a model might overfit the train data and variance gets high. For the complex model, we can reduce the variance and avoid overfitting by training with a large amount of data. (Increasing dataset size with a too simple model won’t help!)

Now let’s experimentally check the bias-variance tradeoff according to the model complexity.

Consider the case when y(x) = x + sin(1.5x) + N (0, 0.3), where N (0, 0.3) is normal distribution with mean 0 and variance 0.3. Here f(x) = x + sin(1.5x) and e = N (0, 0.3). Then let’s create a dataset of size 20 points by randomly generating samples from y.

import matplotlib.pyplot as plt
import numpy as np
import random
random.seed(0)
x_range = np.arange(0,10,0.1)
std = 0.3**(1/2)
x_fit = np.random.uniform(0, 10, 20)
def func(x):
return np.sin(1.5*x) + x
def addNoisetoFunc(f, shape, std):
return f + np.random.randn(*shape) * std
def plotFunc(f):
#plot func
plt.plot(x_range, f(x_range), label='f(x)')

#scatter points
y = addNoisetoFunc(func(x_fit), x_fit.shape, std)
plt.scatter(x_fit, y, label='y')

plotFunc(func)
plt.legend()
plt.show()

Let’s use weighted sum of polynomials as an estimator function for f(x). gn(x) = β0 + β1x + β2x² + ….. + βnx^n

Consider three candidate estimators, g1, g3, and g10, and estimate the coefficients of each of the three estimators using the sampled dataset. Which estimator is underfitting and which one is overfitting?

polynomial_degrees = [1, 3, 10]
theta = {}
fit = {}
x_grid = np.linspace(0, 5, 100)
x_fit = np.random.uniform(0, 5, 20)
plotFunc(func)
std = 0.3**(1/2)
y = addNoisetoFunc(func(x_fit), x_fit.shape, std)
for i, degree in enumerate(polynomial_degrees):
theta[degree] = np.polyfit(x_fit, y, degree)
fit[degree] = np.polyval(theta[degree], x_grid)
plt.plot(x_grid, fit[degree], label="g_(" + str(degree) + ")")
plt.legend()
plt.xlim([0, 5])
plt.ylim([0, 10])

As we can see in the above plot, g1(low degree) is underfitting and g10(high degree) is overfitting.

Now let’s display the tradeoff between bias and variance with model complexity. Generate 100 datasets with each of size 50 by randomly sampling from y and fit the estimators of varying complexity, i.e., g1, g2, ….g15. Then let’s display the squared bias, variance, and error on the testing set for each of the estimators.

from collections import defaultdict#Hyper parameters
np.random.seed(0)
std = 0.3**(1/2) #gaussian noise std
dsize = 50 #size of each dataset
n_dataset = 100 #number of dataset
n_trainset = int(np.ceil(dsize * 0.8))
polynomial_degrees = range(1, 16) #model complexities
#Fixed x values
x = np.linspace(0, 5, dsize)
x = np.random.permutation(x)
x_train = x[:n_trainset]
x_test = x[n_trainset:]
#Variables
theta_hat = defaultdict(list)
pred_train = defaultdict(list)
pred_test = defaultdict(list)
train_errors = defaultdict(list)
test_errors = defaultdict(list)
def error(pred, actual):
return (pred - actual) ** 2

# Loop over datasets
def train_over_polynomial_degrees():
for dataset in range(n_dataset):
# Simulate training/testing targets
y_train = addNoisetoFunc(func(x_train), x_train.shape, std)
y_test = addNoisetoFunc(func(x_test), x_test.shape, std)
# Loop over model complexities
for degree in polynomial_degrees:
# Train model
tmp_theta_hat = np.polyfit(x_train, y_train, degree)
# Make predictions on train set
tmp_pred_train = np.polyval(tmp_theta_hat, x_train)
pred_train[degree].append(tmp_pred_train)
# Test predictions
tmp_pred_test = np.polyval(tmp_theta_hat, x_test)
pred_test[degree].append(tmp_pred_test)
# Mean Squared Error for train and test sets
train_errors[degree].append(np.mean(error(tmp_pred_train, y_train)))
test_errors[degree].append(np.mean(error(tmp_pred_test, y_test)))
def calculate_estimator_bias_squared(pred_test):
pred_test = np.array(pred_test)
average_model_prediction = pred_test.mean(0) # E[g(x)]
return np.mean((average_model_prediction - func(x_test)) ** 2)def calculate_estimator_variance(pred_test):
pred_test = np.array(pred_test)
average_model_prediction = pred_test.mean(0) # E[g(x)]
return np.mean((pred_test - average_model_prediction) ** 2)#without regularization
train_over_polynomial_degrees()
complexity_train_error = []
complexity_test_error = []
bias_squared = []
variance = []
for degree in range(1, 16):
complexity_train_error.append(np.mean(train_errors[degree]))
complexity_test_error.append(np.mean(test_errors[degree]))
bias_squared.append(calculate_estimator_bias_squared(pred_test[degree]))
variance.append(calculate_estimator_variance(pred_test[degree]))
best_model_degree = polynomial_degrees[np.argmin(complexity_test_error)]# Plot values
plt.plot(polynomial_degrees, bias_squared, label='$bias^2$')
plt.plot(polynomial_degrees, variance, label='variance')
plt.plot(polynomial_degrees, complexity_test_error, label='Testing Set Error', linewidth=3)
plt.axvline(best_model_degree, linestyle='--', label=f'Best Model(degree={best_model_degree})')
plt.xlabel('Model Complexity(degree)')
plt.ylim([0, 1])
plt.legend()

We can see the bias-variance tradeoff as we increase the model complexity. For low model complexity(under degree 3), bias is high and variance is low. As the model complexity(polynomial degree) increases, the bias decreases, and variance increases. The best model would be degree 3, where both bias and variance are the lowest.

Regularization

Another way to avoid overfitting and decrease model complexity is by using regularization. There are two popular regularizations that mitigate large weights while training by penalizing them with some value lamda.

L2(Ridge) regularization:

L1(Lasso) regularization:

Let's take the order 10 polynomial and apply L2 regularization. Regularized models have lower MSE and variance with a slight increase of bias. With well-selected parameter lamda, L2 regularization significantly reduces the variance of the model without a substantial increase in its bias and thus can achieve lower MSE.

from sklearn.linear_model import Ridgepolynomial_degrees = 3
dsize = 50
n_traindata = int(np.ceil(dsize * 0.8))
#Fixed x values
x = np.linspace(1, 5, dsize)
x = np.random.permutation(x)
x_train = x[:n_trainset]
x_test = x[n_trainset:]
y_train = addNoisetoFunc(func(x_train), x_train.shape, std)
y_test = addNoisetoFunc(func(x_test), x_test.shape, std)
#without regularization
noreg = np.polyfit(x_train, y_train, degree)
noreg_pred_tests = np.polyval(noreg, x_test)
print("MSE with no regularization and polynomial 10 : " + str(np.mean(error(noreg_pred_tests, y_test))))
print("Bias with no regularization and polynomial 10 : " + str(calculate_estimator_bias_squared(noreg_pred_tests)))
print("Variance with no regularization and polynomial 10 : " + str(calculate_estimator_variance(noreg_pred_tests)))
#convert to ndarray[#sample, feature]
np.asarray(x_train)
np.asarray(y_train)
np.asarray(x_test)
np.asarray(y_test)
x_train = x_train.reshape(-1, 1)
y_train = y_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)
#poly fit with regularization
linridge = Ridge(alpha=2).fit(x_train, y_train)
rg_pred_tests = linridge.predict(x_test)
print("=============================================================================")
print("MSE with regularization and polynomial 10 : " + str(np.mean(error(rg_pred_tests, y_test))))
print("Bias with regularization and polynomial 10 : " + str(calculate_estimator_bias_squared(rg_pred_tests)))
print("Variance with regularization and polynomial 10 : " + str(calculate_estimator_variance(rg_pred_tests)))

MSE with no regularization and polynomial 10 : 1.3450891574737838
Bias with no regularization and polynomial 10 : 1.231012774419095
Variance with no regularization and polynomial 10 : 2.9385471031642894
======================================
MSE with regularization and polynomial 10 : 0.3935519370410087
Bias with regularization and polynomial 10 : 1.2602214598709027
Variance with regularization and polynomial 10 : 1.0206943163933047

In this article, we studied the bias-variance tradeoff in deep learning and discussed how to achieve a low bias and low variance model. Through the hands-on implementation of the bias-variance tradeoff for different model complexities and the impact of L2 regularization, we could better understand the concepts. These are important concepts in deep learning since we need to be able to select proper model complexity and regularization strategies for any given dataset! In the next article, I’ll introduce various performance metrics in deep learning.

References

https://dustinstansbury.github.io/theclevermachine/bias-variance-tradeof

--

--

Hyejin Kim

I’m expanding my experience to machine learning, software engineering, and distributed systems. I write what I learned here.