Gradient Descent (GD), Stochastic Gradient Descent (SGD), and Mini-Batch Stochastic Gradient Descent (Mini-Batch SGD) are optimization algorithms used to train machine learning models, particularly neural networks. They differ in how they update model parameters during training, and each has its advantages and disadvantages. Here's an overview of each:
1. Gradient Descent (GD):
2. Stochastic Gradient Descent (SGD):
3. Mini-Batch Stochastic Gradient Descent (Mini-Batch SGD):
Comparison:
Convergence Speed: GD is the slowest, followed by Mini-Batch SGD and then pure SGD. Mini-Batch SGD often achieves faster convergence than GD due to its intermediate update frequency.
Generalization: Mini-Batch SGD and SGD typically achieve better generalization than GD, thanks to the noise introduced by the random sampling of examples. This noise can help the optimization process escape local minima.
Memory Usage: GD requires the most memory because it processes the entire dataset at once. Mini-Batch SGD is less memory-intensive, and SGD is the least memory-intensive as it operates on one example at a time.
Practical Use: Mini-Batch SGD is the most commonly used optimization algorithm in deep learning due to its balance between convergence speed and generalization.
The choice between these optimization algorithms depends on factors like the dataset size, available memory, and desired training speed. Mini-Batch SGD is a popular choice for training deep neural networks because it combines the advantages of both GD and pure SGD while mitigating their drawbacks.