Gradient descent vs SGD vs mini-batch SGD?

quangngoc

Gradient Descent (GD), Stochastic Gradient Descent (SGD), and Mini-Batch Stochastic Gradient Descent (Mini-Batch SGD) are optimization algorithms used to train machine learning models, particularly neural networks. They differ in how they update model parameters during training, and each has its advantages and disadvantages. Here's an overview of each:

1. Gradient Descent (GD):

Update Frequency: In GD, the model parameters are updated after computing the gradient of the loss with respect to the entire training dataset. It performs a single update per epoch.
Advantages:
- Converges to the global minimum (given certain conditions) for convex loss functions.
  - Theoretical convergence guarantees.
Disadvantages:
- Slow convergence, especially for large datasets, because it computes gradients using the entire dataset.
  - Memory-intensive for large datasets.

2. Stochastic Gradient Descent (SGD):

Update Frequency: In SGD, the model parameters are updated after computing the gradient of the loss with respect to a single randomly chosen training example. It performs one update per example.
Advantages:
- Faster updates, as it processes one example at a time.
  - Better generalization, as the noise in the updates can help escape local minima.
  - Less memory-intensive compared to GD.
Disadvantages:
- Highly noisy updates can lead to oscillations in the loss function and slow convergence.
  - May not converge to the global minimum for non-convex loss functions.

3. Mini-Batch Stochastic Gradient Descent (Mini-Batch SGD):

Update Frequency: Mini-Batch SGD strikes a balance between GD and SGD. It divides the training dataset into smaller random subsets (mini-batches) and computes the gradient of the loss with respect to each mini-batch. It performs one update per mini-batch.
Advantages:
- Faster convergence than GD while being less noisy than pure SGD.
  - Suitable for both small and large datasets, as you can adjust the mini-batch size.
  - Good generalization, as it benefits from the noise in updates.
  - Widely used in deep learning training.
Disadvantages:
- Requires tuning the mini-batch size, which is a hyperparameter.
  - May not converge to the global minimum for non-convex loss functions.

Comparison:

Convergence Speed: GD is the slowest, followed by Mini-Batch SGD and then pure SGD. Mini-Batch SGD often achieves faster convergence than GD due to its intermediate update frequency.
Generalization: Mini-Batch SGD and SGD typically achieve better generalization than GD, thanks to the noise introduced by the random sampling of examples. This noise can help the optimization process escape local minima.
Memory Usage: GD requires the most memory because it processes the entire dataset at once. Mini-Batch SGD is less memory-intensive, and SGD is the least memory-intensive as it operates on one example at a time.
Practical Use: Mini-Batch SGD is the most commonly used optimization algorithm in deep learning due to its balance between convergence speed and generalization.

The choice between these optimization algorithms depends on factors like the dataset size, available memory, and desired training speed. Mini-Batch SGD is a popular choice for training deep neural networks because it combines the advantages of both GD and pure SGD while mitigating their drawbacks.