How is gradient descent different from stochastic gradient descent?

Standard gradient descent uses the entire dataset for each update; stochastic gradient descent uses one example at a time, making each step faster but noisier.

Can gradient descent get stuck in a local minimum?

Yes, especially with non-convex loss surfaces, though modern variants and random initialization often help the optimizer reach good solutions anyway.

What is Gradient Descent?

Gradient descent is an optimization algorithm that finds the minimum of a function by repeatedly moving in the direction of the steepest downward slope. In machine learning it is used to minimize a model's error by adjusting parameters step by step.

The algorithm calculates the gradient (slope) of the loss function with respect to the model parameters. It then updates each parameter by subtracting a fraction of this gradient, moving the model closer to lower error.

A key hyperparameter is the learning rate, which controls the size of each step. Too large a rate can overshoot the minimum; too small a rate makes training slow.

Variants such as stochastic gradient descent and mini-batch gradient descent use subsets of the data to make updates faster and often help the model escape poor local minima.

Example

Imagine walking down a foggy hill to reach the lowest point: at each step you feel the slope beneath your feet and take a small step downhill. After many such steps you arrive near the bottom, just as gradient descent iteratively reduces a model's loss.

Why it matters

Gradient descent (and its variants) is the core engine behind training virtually all modern neural networks and many other machine-learning models, enabling them to learn from data at scale.

Frequently asked questions

The updates may overshoot the minimum, causing the loss to increase or oscillate instead of converging.

Related terms

Loss Function

A loss function quantifies how far a model's predictions are from the true values, serving as the objective that training tries to minimize.

Learning Rate

The learning rate is a hyperparameter that controls the size of the steps an optimization algorithm takes when updating a model's parameters during training.

Backpropagation

Backpropagation is an algorithm for training neural networks by calculating how much each weight contributed to the prediction error and adjusting those weights accordingly. It uses the chain rule to efficiently compute gradients of the loss function.

Adam Optimizer

Adam (Adaptive Moment Estimation) is a popular optimization algorithm used to train machine learning models by iteratively updating parameters based on gradients.

Classification

Classification is a supervised machine learning task that assigns input data to one of several predefined categories or classes based on patterns learned from labeled training examples.

Clustering

Clustering is an unsupervised machine learning technique that automatically groups similar data points together into clusters based on their features, without using any labeled examples.