How to Initialize weights in a neural net so it performs well? — Super fast explanation for Xavier’s Random Weight Initialization

http://www.mdpi.com/1099-4300/19/3/101

We know that in a neural network, weights are initialized usually randomly and that kind of initialization takes fair / significant amount of repetitions to converge to the least loss and reach to the ideal weight matrix. The problem is, this kind of initialization is prone to vanishing or exploding gradient problems.

One way to reduce this problem is carefully choosing the random weight initialization. Xavier’s random weight initialization aka Xavier’s algorithm factors into the equation the size of the network (number of input and output neurons) and addresses these problems.

Xavier Glorot and Yoshua Bengio are the contributors for this concept of initializing better random weights. This not only reduces the chances for running into the gradient problems but also helps to converge to least error faster.

General ways to make it initialize better weights:

a) If you’re using ReLu activation function in the deep nets (I’m talking about the hidden layer’s output activation function) then:

Generate random sample of weights from a Gaussian distribution having mean 0 and a standard deviation of 1.
Multiply that sample with the square root of (2/ni). Where ni is number of input units for that layer.

b) Likewise if you’re using Tanh activation function :

Generate random sample of weights from a Gaussian distribution having mean 0 and a standard deviation of 1.
Multiply that sample with the square root of (1/ni). Where ni is number of input units for that layer.

So what is this Xavier’s initialization?

Only major difference in Xavier’s initialization is the output no term. We add the number of output units for that layer.

For Tanh:

Generate random sample of weights from a Gaussian distribution having mean 0 and a standard deviation of 1.
Multiply that sample with the square root of (1/(ni+no)). Where ni is number of input units, no is the number of output units for that layer respectively.

# python code is hereimport numpy as npW = np.random.rand((x_dim,y_dim))*np.sqrt(1/(ni+no))

Why does this initialization help prevent gradient problems?

This sort of initialization helps to set the weight matrix neither too bigger than 1, nor too smaller than 1. Thus it doesn’t explode or vanish gradients respectively.

I learnt this from Coursera’s Awesome Deep Learning Specialization: deeplearning.ai

Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization :https://www.coursera.org/learn/deep-neural-network/

Here is the original Paper: