How large should a test set be?

Common splits reserve 10-20% of the data for testing, though the exact size depends on the total amount of data available.

Can I look at the test set during model development?

No. Looking at or tuning on the test set invalidates its purpose as an unbiased final evaluation.

What is Test Set?

A test set is a portion of data held out entirely from model training and tuning, used only at the end to measure how well the final model generalizes to new examples.

In machine learning, available data is typically split into three parts: a training set for learning patterns, a validation set for tuning hyperparameters, and a test set that remains untouched until the very end.

The test set provides an unbiased estimate of real-world performance because the model has never seen these examples during development, helping detect overfitting to the training or validation data.

Best practice keeps the test set fixed and uses it only once for final reporting, ensuring the performance numbers reflect how the model will behave on future unseen data.

Example

A researcher splits 10,000 labeled photos into 7,000 for training, 2,000 for validation, and 1,000 for testing. After the model is fully trained and tuned, accuracy is measured only on the 1,000 test photos to report final results.

Why it matters

Without a separate test set, reported performance can be overly optimistic, leading to models that fail in production; it remains the standard way to obtain trustworthy generalization metrics in AI today.

Frequently asked questions

Training data has already been seen by the model, so testing on it gives an overly optimistic score that does not reflect performance on new data.

Related terms

Validation Set

A validation set is a separate portion of a dataset used during model training to evaluate performance and tune hyperparameters.

Overfitting

Overfitting happens when a machine learning model learns the training data too closely, including its noise and quirks, so it fails to perform well on new, unseen data.

Batch Size

Batch size is the number of training examples processed together in a single forward and backward pass during model training.

Chunking

Chunking is the process of breaking large datasets, documents, or files into smaller, fixed-size or semantically meaningful segments. It is a common data preprocessing step in AI/ML pipelines to manage memory and enable efficient processing.

Cosine Similarity

Cosine similarity measures how similar two vectors are by computing the cosine of the angle between them, ignoring their magnitudes.

Data Augmentation

Data augmentation is a technique that artificially increases the size and diversity of a training dataset by creating modified versions of existing data samples.

What is Test Set?

Example

Why it matters

Frequently asked questions

Why can't I just use the training data to test my model?

How large should a test set be?

Can I look at the test set during model development?

Related terms

What is Test Set?

Example

Why it matters

Frequently asked questions

Why can't I just use the training data to test my model?

How large should a test set be?

Can I look at the test set during model development?

Related terms