Skip to content

What is Pretraining?

Pretraining is the first stage of training an AI model on a very large, general dataset so it learns broad patterns and representations before being adapted to specific tasks.

In pretraining, models are typically trained using self-supervised objectives on massive unlabeled datasets such as internet text or image collections. This allows the model to learn useful features like language structure or visual patterns without human-labeled examples.

The process usually involves next-token prediction, masked language modeling, or similar tasks that force the model to understand context and relationships in the data. After pretraining, the model weights capture general knowledge that can be reused.

Pretraining is followed by fine-tuning or adaptation on smaller, task-specific datasets, making the overall training more data-efficient than training from scratch for every new application.

Example

A language model is pretrained on billions of web pages to learn grammar and facts about the world, then later fine-tuned on customer reviews to perform sentiment analysis.

Why it matters

Pretraining enables modern foundation models to achieve strong performance with far less labeled data and compute for downstream tasks, forming the basis of systems like GPT and BERT.

Frequently asked questions

Pretraining uses huge unlabeled data to learn general knowledge first, while regular training often starts from scratch on a specific labeled dataset.