Compare batch norm and layer norm

quangngoc

Batch Normalization (BatchNorm) and Layer Normalization (LayerNorm) are techniques used to normalize the activations of neural networks during training. Both methods aim to improve training stability, accelerate convergence, and enable the training of deeper networks. However, they differ in how they normalize the activations and where normalization is applied within the network. Here's a comparison of BatchNorm and LayerNorm:

Batch Normalization (BatchNorm):

Normalization Scope:
- BatchNorm normalizes activations across the entire mini-batch for a given layer.
Normalization Axes:
- It normalizes along the batch axis (axis 0), meaning it computes statistics (mean and variance) across examples within the mini-batch but independently for each feature dimension.
Effect on Activation Statistics:
- BatchNorm adjusts the mean and variance of activations for each feature dimension to be close to zero mean and unit variance.
Training vs. Inference:
- BatchNorm maintains separate statistics (mean and variance) during training and inference. During inference, it uses the moving averages of these statistics computed during training.
Regularization Effect:
- BatchNorm has a slight regularization effect due to the noise introduced by batch statistics during training.
Applications:
- Commonly used in convolutional neural networks (CNNs) and fully connected layers in various deep learning architectures.

Layer Normalization (LayerNorm):

Normalization Scope:
- LayerNorm normalizes activations within a single training example but independently for each feature dimension (layer).
Normalization Axes:
- It normalizes along the feature dimension (axis 1), which is often referred to as the channel dimension in CNNs.
Effect on Activation Statistics:
- LayerNorm adjusts the mean and variance of activations for each feature dimension (layer) to be close to zero mean and unit variance.
Training vs. Inference:
- LayerNorm computes statistics (mean and variance) separately for each training example and maintains the same statistics for inference.
Regularization Effect:
- LayerNorm has a stronger regularization effect compared to BatchNorm because it normalizes each example independently, reducing the risk of overfitting.
Applications:
- LayerNorm is often used in recurrent neural networks (RNNs), transformer-based models (e.g., BERT), and other architectures where normalization across the batch dimension is less suitable.

Comparison Summary:

BatchNorm normalizes activations across the entire mini-batch, while LayerNorm normalizes within each example along the feature dimension.
BatchNorm uses batch statistics, which can introduce some noise during training. LayerNorm uses instance-specific statistics, providing a stronger regularization effect.
BatchNorm is commonly applied in CNNs and fully connected layers, while LayerNorm is often used in RNNs and transformer architectures.
Both techniques help mitigate the vanishing/exploding gradient problem and improve training stability, but the choice between them depends on the specific network architecture and problem at hand.

Links: