What is Validation Set?
A validation set is a separate portion of a dataset used during model training to evaluate performance and tune hyperparameters.
It is held out from the training data so the model does not learn directly from it, allowing unbiased checks on how well the model is generalizing at each stage of training.
Practitioners use it to select hyperparameters such as learning rate or model depth and to decide when to stop training (early stopping) before the model starts overfitting.
Once tuning is complete, the final model is evaluated on a completely untouched test set to report true generalization performance.
Example
When building a spam filter, you might split 100,000 emails into 70k for training, 15k for validation, and 15k for testing; the validation emails are used to try different thresholds and feature counts until the model performs well on them.
Why it matters
It enables reliable hyperparameter selection and guards against overfitting to the test data, which is essential for building trustworthy AI systems that perform well on new, unseen data.
Frequently asked questions
The validation set is used repeatedly during development to tune the model, while the test set is used only once at the end for final evaluation.
Related terms
A test set is a portion of data held out entirely from model training and tuning, used only at the end to measure how well the final model generalizes to new examples.
Overfitting happens when a machine learning model learns the training data too closely, including its noise and quirks, so it fails to perform well on new, unseen data.
Batch size is the number of training examples processed together in a single forward and backward pass during model training.
Chunking is the process of breaking large datasets, documents, or files into smaller, fixed-size or semantically meaningful segments. It is a common data preprocessing step in AI/ML pipelines to manage memory and enable efficient processing.
Cosine similarity measures how similar two vectors are by computing the cosine of the angle between them, ignoring their magnitudes.
Data augmentation is a technique that artificially increases the size and diversity of a training dataset by creating modified versions of existing data samples.