What is Data Augmentation?
Data augmentation is a technique that artificially increases the size and diversity of a training dataset by creating modified versions of existing data samples.
It works by applying random or systematic transformations to original data points, such as rotating, flipping, or cropping images, or replacing words with synonyms in text. These changes create new training examples that are similar but not identical to the originals.
The core idea is to help machine learning models learn more robust patterns and reduce overfitting, especially when the original dataset is small or lacks variety. It is commonly used during the training phase without changing the underlying labels.
Augmentation strategies can be simple rule-based transforms or learned via models, and they are applied on-the-fly or pre-generated depending on the task and compute resources.
Example
In an image classification task with only 100 photos of cats, data augmentation might create additional examples by flipping each photo horizontally or adjusting brightness, effectively tripling the training set size.
Why it matters
Modern deep learning models require large amounts of varied data to generalize well; augmentation allows strong performance even with limited real-world data and is now a standard step in most computer vision and NLP pipelines.
Frequently asked questions
No, the transformations are designed to preserve the original label so the new samples remain valid training examples.
Related terms
Overfitting happens when a machine learning model learns the training data too closely, including its noise and quirks, so it fails to perform well on new, unseen data.
Regularization is a set of techniques in machine learning that reduce overfitting by adding a penalty term to the model's loss function, discouraging overly complex or large parameter values.
Synthetic data is artificially generated information designed to mimic the statistical properties of real-world data, created by algorithms rather than collected from actual events or observations.
Transfer learning is a machine learning method that reuses a model trained on one task as the starting point for a different but related task.
Batch size is the number of training examples processed together in a single forward and backward pass during model training.
A dataset is a structured collection of data points used to train, validate, or test machine learning models.