Skip to content

What is Data Augmentation?

Data augmentation is a technique that artificially increases the size and diversity of a training dataset by creating modified versions of existing data samples.

It works by applying random or systematic transformations to original data points, such as rotating, flipping, or cropping images, or replacing words with synonyms in text. These changes create new training examples that are similar but not identical to the originals.

The core idea is to help machine learning models learn more robust patterns and reduce overfitting, especially when the original dataset is small or lacks variety. It is commonly used during the training phase without changing the underlying labels.

Augmentation strategies can be simple rule-based transforms or learned via models, and they are applied on-the-fly or pre-generated depending on the task and compute resources.

Example

In an image classification task with only 100 photos of cats, data augmentation might create additional examples by flipping each photo horizontally or adjusting brightness, effectively tripling the training set size.

Why it matters

Modern deep learning models require large amounts of varied data to generalize well; augmentation allows strong performance even with limited real-world data and is now a standard step in most computer vision and NLP pipelines.

Frequently asked questions

No, the transformations are designed to preserve the original label so the new samples remain valid training examples.