2 | FILTER INITIALIZATION
Filter initialization is a crucial factor that impacts the final accuracy more than the learning algorithm in some cases in feed-forward networks (FNNs). Smaller initial weights result in smaller gradients which slows down the training. Larger initial weights can cause saturation or instability at activation; hence optimal weight initialization is crucial for preventing the output of any layers from getting exploded or vanishing through activation and is believed to be a critical factor for speed and ability to converge. When CNN was proposed, it was common to set the initial weight as Gaussian noise with 0 mean and standard deviation of 0.01. Over the years, other initialization technics have also been proposed to prevent exploding, vanishing gradients, and dead neurons The optimum filter initialization is also seen as an open question with its relation to training sample labels, architecture, objective function, and types of outcomes of the algorithm. Three scenarios are commonly used to initialize the weights.
2.1 | Random initialization
In random initialization, the values are randomly initialized near 0 and usually follow a normal or uniform distribution. The issue with random initialization is inconsistency. If the input values are too small, the convolution process would create a significant difference with epochs and result in different outcomes. Small values imply slow learning, prone to local minima, and possible vanishing gradient issue. On the other hand, large initialization values could saturate the neuron’s outputs. They also could create exploding gradient issues, which would result in oscillation around the optimum target or instability condition.
For deep networks, some alternate versions have been proposed. LeCun proposed uniform distribution between -2.4/Fi and 2.4/Fi, where Fi is the number of the input nodes. The reason is to have a similar initial standard deviation of the weighted sum to be in the same range for all nodes and ensure they fall in a specific operating region of the sigmoidal function. In contrast, it establishes a relationship with certain activation functions. Also, it is only feasible to apply when all connections sharing similar weights belong to nodes with identical Fi. The general term for variance could be k/n, where k is a constant and depends on the activation function, while n is either the number of input nodes to the weight tensor or the number of input and output nodes of the weight tensor. Other widely used examples of random initialization are Xavier/Glorot and He initialization. They use a normal distribution with mean zero and variance 1/n and 2/n, respectively. However, in some cases, uniform distributions are also used. Xavier is simple and sets the activations’ variance the same across every layer. However, it is not applicable when the activation function is non-differentiable. He initialization overcomes this limitation and is widely used with the ReLU activation function, which is non-differentiable at zero. LeCun, Xavier, and He do not eliminate the issue of vanishing or exploding gradients but mitigate the problem to a better extent.
2.2 | Zero or constant initialization
In this type of initialization, all the weights are set to either zero or a constant value (usually 1). As all the weights are the same, the activation also results in the same value, which results in symmetric hidden layers. In the supervised approach, the derivative of the loss function remains similar for all the nodes in a filter. Distance-based clustering methods would not benefit from this initialization since the constant value would only mimic the input values.
2.3 | Pre-trained initialization
Compared to the above initializations, pre-trained initialization is a reasonably new approach. There are two types of pre-trained weights: Transfer Learning, in which trained weights from any pre-trained model are borrowed and used as the initial state before starting new learning for the current method. Knowledge transfer accelerates the learning and generalization process. Earhan et al. proved the claim by conducting comprehensive experiments on existing algorithms with pre-trained weights over the traditional approach. Pre-trained weights are also used in the student-teacher method, where a large network is trained extensively, and the optimized weights are transferred to comparable lightweight architecture for lighter applications.