Why do we often subtract the gradient during training?

Subtracting moves parameters opposite the direction of steepest error increase, which reduces the loss.

What does a zero gradient mean?

It usually signals a local minimum, maximum, or saddle point where the loss cannot be improved by small changes in any direction.

What is Gradient?

In machine learning, a gradient is a vector of partial derivatives that shows how a model's error (loss) changes with respect to its parameters. It points toward the steepest increase in error, so optimization algorithms move in the opposite direction to improve the model.

Mathematically, the gradient of a multi-variable function is a vector containing its partial derivatives with respect to each input variable, representing the slope or rate of change in every direction.

In ML, gradients are calculated for the loss function during training; they indicate how small changes to weights and biases affect overall prediction error.

This information drives iterative updates in algorithms like gradient descent, allowing models to learn by gradually reducing errors.

Example

When training a simple linear regression model to predict house prices, the gradient for the weight on 'square footage' might show that increasing the weight reduces error, so the algorithm adjusts that weight accordingly in the next step.

Why it matters

Gradients power the training of nearly all modern neural networks and other ML models via gradient-based optimization, enabling scalable learning from data.

Frequently asked questions

A derivative measures change for a single variable, while a gradient is the vector of derivatives for multiple variables.

Related terms

Gradient Descent

Gradient descent is an optimization algorithm that finds the minimum of a function by repeatedly moving in the direction of the steepest downward slope. In machine learning it is used to minimize a model's error by adjusting parameters step by step.

Backpropagation

Backpropagation is an algorithm for training neural networks by calculating how much each weight contributed to the prediction error and adjusting those weights accordingly. It uses the chain rule to efficiently compute gradients of the loss function.

Loss Function

A loss function quantifies how far a model's predictions are from the true values, serving as the objective that training tries to minimize.

Learning Rate

The learning rate is a hyperparameter that controls the size of the steps an optimization algorithm takes when updating a model's parameters during training.

Active Learning

Active learning is a machine learning technique where the model itself selects the most informative unlabeled data points to be labeled by a human, rather than labeling data randomly or all at once.

Adam Optimizer

Adam (Adaptive Moment Estimation) is a popular optimization algorithm used to train machine learning models by iteratively updating parameters based on gradients.