How does batch norm behave at inference time?

It uses fixed running averages of mean and variance computed during training rather than statistics from a single test batch.

What is Batch Normalization?

Batch Normalization is a technique in neural networks that normalizes the inputs to each layer by subtracting the batch mean and dividing by the batch standard deviation, then scaling and shifting with learnable parameters.

It works by computing the mean and variance of each feature across a mini-batch during training. The activations are then normalized to have zero mean and unit variance before being scaled by gamma and shifted by beta parameters that the network learns.

This reduces internal covariate shift, allowing higher learning rates and faster convergence while also providing a mild regularization effect that can reduce the need for dropout.

During inference, running averages of mean and variance collected during training are used instead of batch statistics to ensure consistent predictions.

Example

In a CNN for classifying images, batch normalization layers are often inserted after convolutional layers. This lets the network train effectively even when the distribution of pixel values varies across batches of photos.

Why it matters

Batch Normalization made training much deeper networks practical and reliable, becoming a standard component in modern architectures like ResNet and Transformer variants.

Frequently asked questions

It stabilizes training by keeping activations in a healthy range, allowing faster convergence and deeper models without vanishing or exploding gradients.

Related terms

Neural Network

A neural network, or artificial neural network (ANN), is a computational model inspired by the human brain that learns to recognize patterns in data by passing information through layers of interconnected artificial neurons.

Deep Learning

Deep Learning is a subset of machine learning that uses multi-layered artificial neural networks to automatically learn complex patterns from large datasets.

Dropout

Dropout is a regularization technique used during neural network training that randomly sets a fraction of neurons to zero on each forward pass to reduce overfitting.

Gradient Descent

Gradient descent is an optimization algorithm that finds the minimum of a function by repeatedly moving in the direction of the steepest downward slope. In machine learning it is used to minimize a model's error by adjusting parameters step by step.

Activation Function

An activation function is a mathematical operation applied to the output of a neuron in a neural network that decides whether the neuron should 'fire' and pass on a signal.

Autoencoder

An autoencoder is a neural network that learns to compress input data into a smaller representation and then reconstruct the original data from that compressed form.