Weight initialization

In machine learning and deep learning, weight initialization describes the initial step in creating a neural network. A neural network contains trainable parameters that are modified during training. Before training, these need to be assigned initial values. This assignment step is weight initialization.

The choice of weight initialization method affects the speed of convergence, the scale of neural activation within the network, the scale of gradient signals during backpropagation, and the quality of the final model. Proper initialization is necessary for avoiding issues such as vanishing and exploding gradients, saturation of activation function, etc.

Constant initialization

We discuss the main methods of initialization in the context of a multilayer perceptron (MLP). Specific strategies for initializing other network architectures are discussed in later sections.

For an MLP, there are only two kinds of trainable parameters, called weights and biases. Each layer $l$ contains a weight matrix $W^{(l)}\in \mathbb {R} ^{n_{l-1}\times n_{l}}$ and a bias vector $b^{(l)}\in \mathbb {R} ^{n_{l}}$ , where $n_{l}$ is the number of neurons in that layer. A weight initialization method is an algorithm for setting the initial values for $W^{(l)},b^{(l)}$ for each layer $l$ .

The simplest form is zero initialization: $W^{(l)}=0,b^{(l)}=0$ Zero initialization is usually used for initializing biases, but it is not used for initializing weights, as it leads to symmetry in the network, causing all neurons to learn the same features.

Recurrent neural networks typically use activation functions with bounded range, such as sigmoid and tanh, since unbounded activation may cause exploding values. (Le, Jaitly, Hinton, 2015)^[1] suggested initializing weights in the recurrent parts of the network to identity and zero bias.

In most cases, the biases are initialized to zero, though some situations can use a nonzero initialization. For example, in multiplicative units, such as the forget gate of LSTM, the bias can be initialized to 1 to allow good gradient signal through the gate.^[2] For neurons with ReLU activation, one can initialize the bias to a small positive value like 0.1, so that the gradient is likely nonzero at initialization, avoiding the dying ReLU problem.^[3]^: 305^[4]

Random initialization

Random initialization means sampling the weights from a normal distribution or a uniform distribution, usually independently.

LeCun initialization

The uniform random initialization is typically by sampling each entry in $W^{(l)}$ from the uniform distribution ${\mathcal {U}}(\pm 1/{\sqrt {n_{l-1}}})$ . It is designed to preserve the variance of neural activations during the forward pass.

This is sometimes called LeCun initialization, as it was popularized in (LeCun et al, 1998).^[5]

Glorot initialization

Glorot initialization (or Xavier initialization) was proposed by Xavier Glorot and Yoshua Bengio.^[6] It was designed as a compromise between two goals: to preserve activation variance during the forward pass and to preserve gradient variance during the backward pass.

keep the scale of gradients roughly the same in all layers.

For uniform initialization, it samples each entry in $W^{(l)}$ independently and identically from ${\mathcal {U}}(\pm {\sqrt {6/(n_{l+1}+n_{l-1})}})$ . In the context, $n_{l-1}$ is also called the "fan-in", and $n_{l+1}$ the "fan-out".

He initialization

As Glorot initialization performs poorly for ReLU activation,^[7] He initialization (or Kaiming initialization) was proposed by Kaiming He et al.^[8] for networks with ReLU activation. It samples each entry in $W^{(l)}$ from ${\mathcal {N}}(0,{\sqrt {2/n_{l-1}}})$ .

Orthogonal initialization

(Saxe et al 2013)^[9] proposed orthogonal initialization: initializing weight matrices as random semi-orthogonal matrices, multiplied by a factor that depends on the activation function of the layer. It was designed so that if one initializes a deep linear network this way, then its training time until convergence is independent of depth.^[10]

Related to this approach, unitary initialization proposes to parameterize the weight matrices to be unitary matrices, with the result that at initialization they are random unitary matrices (and throughout training, they remain unitary). This is found to improve long-sequence modelling in LSTM.^[11]^[12]

Orthogonal initialization has been generalized to layer-sequential unit-variance (LSUV) initialization. It is a data-dependent initialization method, and can be used in convolutional neural networks. It first initializes weights of each convolution or fully connected layer with orthonormal matrices. Then, proceeding from the first to the final layer, it runs a forward pass on a random minibatch, and divides the layer's weights by the standard deviation of its output, so that its output has variance approximately 1.^[13]^[14]

Other random initializations

Fixup initialization is designed specifically for networks with residual connections and without batch normalization, as follows:^[15]

Initialize the classification layer and the last layer of each residual branch to 0.
Initialize every other layer using a standard method (e.g., He et al. (2015)), and scale only the weight layers inside residual branches by $L^{-{\frac {1}{2m-2}}}$ .
Add a scalar multiplier (initialized at 1) in every branch and a scalar bias (initialized at 0) before each convolution, linear, and element-wise activation layer.

Instead of initializing all weights with random values on the order of $O(1/{\sqrt {n}})$ , sparse initialization initialized only a small subset of the weights with larger random values, and the other weights zero, so that the total variance is still on the order of $O(1)$ .^[16]

Random walk initialization was designed for MLP so that during backpropagation, the L2 norm of gradient at each layer performs an unbiased random walk as one moves from the last layer to the first. The algorithm.^[17]

Miscellaneous

For hyperbolic tangent activation function, a particular scaling is sometimes used: $1.7159\tanh(2x/3)$ . This was sometimes called "LeCun's tanh". It was designed so that it maps the interval $[-1,+1]$ to itself, thus ensuring that the overall gain is around 1 in "normal operating conditions", and that $|f''(x)|$ is at maximum when $x=-1,+1$ , which improves convergence at the end of training.^[18]^[5]

In self-normalizing neural networks, the SELU activation function $\mathrm {SELU} (x)=\lambda {\begin{cases}x&{\text{if }}x>0\\\alpha e^{x}-\alpha &{\text{if }}x\leq 0\end{cases}}$ with parameters $\lambda \approx 1.0507,\alpha \approx 1.6733$ makes it such that the mean and variance of the output of each layer has $(0,1)$ as an attracting fixed-point. This makes initialization less important, though they recommend initializing weights randomly with variance $1/n_{l-1}$ .^[19]

History

Random weight initialization was used since Frank Rosenblatt's perceptrons. An early work that described weight initialization specifically was (LeCun et al, 1998).^[5]

Before the 2010s era of deep learning, it was common to initialize models by "pre-training" using an unsupervised learning algorithm that is not backpropagation, as it was difficult to directly train deep neural networks by backpropagation.^[20] For example, the deep belief network was trained by contrastive divergence is applied to each network in turn, starting from the "lowest" pair of layers.^[21]

(Martens, 2010)^[16] proposed a quasi-Newton method to directly train deep networks. The work generated considerable excitement that initializing networks without pre-training phase was possible.^[22] However, a 2013 paper demonstrated that with well-chosen hyperparameters, momentum gradient descent with weight initialization was sufficient for training neural networks, a combination that is still in use as of 2024.^[23]

Since then, the impact of initialization on tuning the variance has become less important, with methods developed to automatically tune variance, like batch normalization tuning the variance of the forward pass, and momentum-based optimizers tuning the variance of the backward pass.

References

^ Le, Quoc V.; Jaitly, Navdeep; Hinton, Geoffrey E. (2015). "A Simple Way to Initialize Recurrent Networks of Rectified Linear Units". arXiv:1504.00941 [cs.NE].
^ Jozefowicz, Rafal; Zaremba, Wojciech; Sutskever, Ilya (2015-06-01). "An Empirical Exploration of Recurrent Network Architectures". Proceedings of the 32nd International Conference on Machine Learning. PMLR: 2342–2350.
^ Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). Deep learning. Adaptive computation and machine learning. Cambridge, Massachusetts: The MIT Press. ISBN 978-0-262-03561-3.
^ Lu, Lu; Shin, Yeonjong; Su, Yanhui; Karniadakis, George Em (2019). "Dying ReLU and Initialization: Theory and Numerical Examples". Communications in Computational Physics. 28 (5): 1671–1706. arXiv:1903.06733. doi:10.4208/cicp.OA-2020-0165.
^ ^a ^b ^c LeCun, Yann; Bottou, Leon; Orr, Genevieve B.; Müller, Klaus -Robert (1998), Orr, Genevieve B.; Müller, Klaus-Robert (eds.), "Efficient BackProp", Neural Networks: Tricks of the Trade, Berlin, Heidelberg: Springer, pp. 9–50, doi:10.1007/3-540-49430-8_2, ISBN 978-3-540-49430-0, retrieved 2024-10-05
^ Glorot, Xavier; Bengio, Yoshua (2010-03-31). "Understanding the difficulty of training deep feedforward neural networks". Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings: 249–256.
^ Kumar, Siddharth Krishna (2017). "On weight initialization in deep neural networks". arXiv:1704.08863 [cs.LG].
^ He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification". arXiv:1502.01852 [cs.CV].
^ Saxe, Andrew M.; McClelland, James L.; Ganguli, Surya (2013). "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks". arXiv:1312.6120 [cs.NE].
^ Hu, Wei; Xiao, Lechao; Pennington, Jeffrey (2020). "Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks". arXiv:2001.05992 [cs.LG].
^ Arjovsky, Martin; Shah, Amar; Bengio, Yoshua (2016-06-11). "Unitary Evolution Recurrent Neural Networks". Proceedings of the 33rd International Conference on Machine Learning. PMLR: 1120–1128.
^ Henaff, Mikael; Szlam, Arthur; LeCun, Yann (2017-03-15). "Recurrent Orthogonal Networks and Long-Memory Tasks". arXiv:1602.06662 [cs.NE].
^ Mishkin, Dmytro; Matas, Jiri (2016-02-19), All you need is a good init, arXiv:1511.06422, retrieved 2024-10-05
^ Xie, Di; Xiong, Jiang; Pu, Shiliang (2017). All You Need Is Beyond a Good Init: Exploring Better Solution for Training Extremely Deep Convolutional Neural Networks With Orthonormality and Modulation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6176–6185.
^ Zhang, Hongyi; Dauphin, Yann N.; Ma, Tengyu (2019). "Fixup Initialization: Residual Learning Without Normalization". arXiv:1901.09321 [cs.LG].
^ ^a ^b Martens, James (2010-06-21). "Deep learning via Hessian-free optimization". Proceedings of the 27th International Conference on International Conference on Machine Learning. ICML'10. Madison, WI, USA: Omnipress: 735–742. ISBN 978-1-60558-907-7.
^ Sussillo, David; Abbott, L. F. (2014). "Random Walk Initialization for Training Very Deep Feedforward Networks". arXiv:1412.6558 [cs.NE].
^ Y. LeCun. Generalization and network design strategies. In R. Pfeifer, Z. Schreter, F. Fogelman, and L. Steels, editors, Connectionism in Perspective, Amsterdam, 1989. Elsevier. Proceedings of the International Conference Connectionism in Perspective, University of Zurich, 10. -- 13. October 1988.
^ Klambauer, Günter; Unterthiner, Thomas; Mayr, Andreas; Hochreiter, Sepp (2017). "Self-Normalizing Neural Networks". Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.
^ Erhan, Dumitru; Courville, Aaron; Bengio, Yoshua; Vincent, Pascal (2010-03-31). "Why Does Unsupervised Pre-training Help Deep Learning?". Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings: 201–208.
^ Bengio, Yoshua; Lamblin, Pascal; Popovici, Dan; Larochelle, Hugo (2006). "Greedy Layer-Wise Training of Deep Networks". Advances in Neural Information Processing Systems. 19. MIT Press.
^ Glorot, Xavier; Bordes, Antoine; Bengio, Yoshua (2011-06-14). "Deep Sparse Rectifier Neural Networks". Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings: 315–323.
^ Sutskever, Ilya; Martens, James; Dahl, George; Hinton, Geoffrey (2013-05-26). "On the importance of initialization and momentum in deep learning" (PDF). Proceedings of the 30th International Conference on Machine Learning. PMLR: 1139–1147.