Thứ Ba, 11 tháng 2, 2020

Why does an internal covariate shift slow down the training procedure?

Let’s say you have a goal to reach, which is easier, a fixed goal vs a goal that keeps moving about? It is clear that a static goal is much easier to reach than a dynamic goal.
Each layer in a neural net has a simple goal, to model the input from the layer below it, so each layer tries to adapt to it’s input but for hidden layers, things get a bit complicated. The input’s statistical distribution changes after a few iterations, so if the input statistical distribution keeps changing, called internal covariate shift, the hidden layers will keep trying to adapt to that new distribution hence slowing down convergence. It is like a goal that keeps changing for hidden layers.
So the batch normalization (BN) algorithm tries to normalize the inputs to each hidden layer so that their distribution is fairly constant as training proceeds. This improves convergence of the neural net.

If a neuron is able to see the entire range of values across the entire training set that it is going to get as subsequent inputs, training can be made faster by normalizing the distribution (i.e. making its mean zero and variance one). The authors call it input whitening:
It has been long known (LeCun et al., 1998b; Wiesler & Ney, 2011) that the network training converges faster if its inputs are whitened – i.e., linearly transformed to have zero means and unit variances, and decorrelated. As each layer observes the inputs produced by the layers below, it would be advantageous to achieve the same whitening of the inputs of each layer. By whitening the inputs to each layer, we would take a step towards achieving the fixed distributions of inputs that would remove the ill effects of the internal covariate shift.

However, for a large training set, it is not computationally feasible to normalize inputs for each neuron. Hence, the normalization is carried out on a per-batch basis.

Batch Normalization may not always be optimal for learning

Batch normalization may lead to, say, inputs always lying in the linear range of sigmoid function. Hence the inputs are shifted and scaled by parameters γ and β. Optimal values of these hyperparameters can be learnt by using backpropagation.
Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity. To address this, we make sure that the transformation inserted in the network can represent the identity transform. To accomplish this, we introduce, for each activation x(k) , a pair of parameters γ(k), β(k), which scale and shift the normalized value:
y(k) = γ(k)xnormalized(k) + β(k).

So my curiosity is that is BN only used to speed up the training process? Does it improve the accuracy as I tested on a CNN training process, the performance even increases!

Không có nhận xét nào:

Đăng nhận xét