Intro to Deep Learning — weight initialization

Sigmoid or tanh functions are very popular activation functions used for neural network models. When training multi-layer neural networks, at the forward propagation phase we compute the weighted sum of the output of the previous layer and pass the activation of this value to the next layer. At the backpropagation phase, we calculate the gradients and update weights for each layer.

Vanishing gradient problems

However, Sigmoid or tanh activation functions can run into a vanishing gradient problem, which is caused by the properties of the derivation of these functions. The vanishing gradient problem means the phenomenon that gradients in the hidden layers become so small that the model doesn’t train well.

The gradient of the layer ht can be computed as the following, where pi is the activation function.

Gradient calculation

For sigmoid or tanh functions, the derivatives can be represented as the following.

sigmoid and derivative of sigmoid

Look at the above graph of sigmoid and derivative of the sigmoid function. The value range of sigmoid is (0, 1) and the value of derivative of sigmoid is at most 0.25. Then let’s say we have 100 hidden layers and compute the gradient while backpropagating errors. According to the above formula of gradient calculation, after 100 layers the value of gradient reduces to 0.25¹⁰⁰ of the original value. This will cause the update magnitudes of earlier layers to be very small compared to later layers.

Relu activation

One way to avoid vanish gradient problem is to use Relu activation function. In Relu activation, gradients are not close to 0 for higher values of input for input value > 0.

Relu activation

Dead neurons

However, Relu is not totally free from the gradient vanishing problem. For the input values smaller than zero, the output of Relu is zero. In this case hidden units will not fire for any input, which is often referred to as “dead neurons”. Then bad weights initialization can lead to bad training when using Relu, too.

Asymmetric weight initialization

Here comes the importance of proper weight initialization for each layer in the multi-layer networks. Novel asymmetric weight initialization techniques such as Xavier and He were invented to solve this problem. Xavier is often applied to the networks when using sigmoid/tanh activations and He is applied for Relu activations. Then let’s check if these asymmetric weight initialization actually impact the training of multi-layer networks.

The below violin plot shows the distribution of outputs from each layer with sigmoid and tanh activation functions when standard weight initialization is applied. The distribution of outputs is very unstable and this phenomenon gets worse as the standard deviation of the initialization is increased.

import keras
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from keras import initializers
from keras.datasets import mnist
from utils import (
compile_model,
create_mlp_model,
get_activations,
grid_axes_it,
)
seed = 10# Number of points to plot
n_train = 1000
n_test = 100
n_classes = 10
# Network params
n_hidden_layers = 5
dim_layer = 100
batch_size = n_train
epochs = 1
# Load and prepare MNIST dataset.
n_train = 60000
n_test = 10000
(x_train, y_train), (x_test, y_test) = mnist.load_data()
num_classes = len(np.unique(y_test))
data_dim = 28 * 28
x_train = x_train.reshape(60000, 784).astype('float32')[:n_train]
x_test = x_test.reshape(10000, 784).astype('float32')[:n_train]
x_train /= 255
x_test /= 255
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
# Run the data through a few MLP models and save the activations from
# each layer into a Pandas DataFrame.
rows = []
sigmas = [0.10, 0.14, 0.28]
def test_initializer_normal(func):
for stddev in sigmas:
init = initializers.RandomNormal(mean=0.0, stddev=stddev, seed=seed)
activation = func
model = create_mlp_model(
n_hidden_layers,
dim_layer,
(data_dim,),
n_classes,
init,
'zeros',
activation
)
compile_model(model)
output_elts = get_activations(model, x_test)
n_layers = len(model.layers)
i_output_layer = n_layers - 1
for i, out in enumerate(output_elts[:-1]):
if i > 0 and i != i_output_layer:
for out_i in out.ravel()[::20]:
rows.append([i, stddev, out_i])

def test_initializer(func, initializer):
for stddev in sigmas:
init = initializer
activation = func
model = create_mlp_model(
n_hidden_layers,
dim_layer,
(data_dim,),
n_classes,
init,
'zeros',
activation
)
compile_model(model)
output_elts = get_activations(model, x_test)
n_layers = len(model.layers)
i_output_layer = n_layers - 1
for i, out in enumerate(output_elts[:-1]):
if i > 0 and i != i_output_layer:
for out_i in out.ravel()[::20]:
rows.append([i, stddev, out_i])
# Plot previously saved activations from the 5 hidden layers
# using different initialization schemes.
def plot_weight(func):
fig = plt.figure(figsize=(12, 6))
axes = grid_axes_it(len(sigmas), 1, fig=fig)
for sig in sigmas:
ax = next(axes)
ddf = df[df['Standard Deviation'] == sig]
sns.violinplot(x='Hidden Layer', y='Output', data=ddf, ax=ax, scale='count', inner=None)
ax.set_xlabel('')
ax.set_ylabel('')
ax.set_title('Weights Drawn from $N(\mu = 0, \sigma = {%.2f})$' % sig, fontsize=13)if sig == sigmas[1]:
ax.set_ylabel(func + " Neuron Outputs")
if sig != sigmas[-1]:
ax.set_xticklabels(())
else:
ax.set_xlabel("Hidden Layer")
plt.tight_layout()
plt.show()
test_initializer_normal('sigmoid')
df = pd.DataFrame(rows, columns=['Hidden Layer', 'Standard Deviation', 'Output'])
plot_weight("sigmoid")
test_initializer_normal('tanh')
df = pd.DataFrame(rows, columns=['Hidden Layer', 'Standard Deviation', 'Output'])
plot_weight("tanh")

After applying Xavier to the sigmoid/tanh, the distributions of outputs between each hidden layer became similar and much stable. Using the randomized asymmetric weight initializations, we can avoid vanishing/exploding gradient problems.

#Xavier for sigmoid & tanh
initializer = initializers.GlorotNormal(seed=seed)
test_initializer('sigmoid', initializer)
df = pd.DataFrame(rows, columns=['Hidden Layer', 'Standard Deviation', 'Output'])
plot_weight("sigmoid")
test_initializer('tanh', initializer)
df = pd.DataFrame(rows, columns=['Hidden Layer', 'Standard Deviation', 'Output'])
plot_weight("tanh")

The layers with Relu activation functions with standard normalization are even worse, the outputs from each layer are almost zero with the highest standard deviation. This shows the dead neuron phenomenon. Relu prevents vanishing gradient problems of sigmoid and tanh activations but bad initialization can cause dead neurons and the model can’t learn from these neurons while training.

He initialization prevents dead neurons for Relu activations. After applying He initilization, the distributions of outputs between each hidden layer became similar and much stable.

#Try relu activation
test_initializer_normal('relu')
df = pd.DataFrame(rows, columns=['Hidden Layer', 'Standard Deviation', 'Output'])
plot_weight("relu")
#He for Relu
initializer = initializers.HeNormal(seed=seed)
test_initializer('relu', initializer)
df = pd.DataFrame(rows, columns=['Hidden Layer', 'Standard Deviation', 'Output'])
plot_weight("relu")

In this post, I introduced vanishing gradient problems in multi-layer neural networks and several asymmetric weight initialization techniques. We also empirically showed how asymmetric weight initialization can mitigate the vanishing/exploding gradients.

If your machine learning model has no error but not training well, (though there might be other potential reasons) you can try checking the weight distributions of hidden layers!

References

I’m expanding my experience to machine learning, software engineering, and data engineering. Here I write what I learned.