Why not just split text into words?

Word-level splitting creates huge vocabularies and struggles with new or rare words, while subword methods handle both efficiently.

Does tokenization happen before or after training?

The tokenizer is usually trained first on a large corpus to build its vocabulary, then used to prepare all data for model training.

What is Tokenization?

Tokenization is the process of breaking text into smaller units called tokens that language models can process numerically.

It converts raw strings into sequences of tokens using algorithms such as word splitting, subword methods like Byte Pair Encoding (BPE), or character-level segmentation. Each token is then mapped to an ID from a fixed vocabulary.

Subword tokenization balances vocabulary size and coverage, allowing models to handle rare words by composing them from common pieces while keeping common words as single tokens.

The resulting token sequence is what gets turned into embeddings and fed into transformer layers during both training and inference.

Example

The sentence 'ChatGPT is helpful!' might become the tokens ['Chat', 'G', 'PT', ' is', ' helpful', '!'] depending on the tokenizer, each mapped to a numeric ID.

Why it matters

Tokenization is the essential first step that lets LLMs turn human language into numbers they can learn from, directly affecting model efficiency, vocabulary size, and handling of different languages.

Frequently asked questions

A token is a chunk of text—often a word or part of a word—that the model treats as a single unit.

Related terms

Embedding

An embedding (or vector embedding) is a way to represent words, sentences, or other data as dense numerical vectors in a high-dimensional space so that similar items end up close together.

Transformer

A Transformer is a neural network architecture that processes sequential data like text using self-attention to weigh relationships between all parts of the input at once.

Natural Language Processing

Natural Language Processing (NLP) is a branch of artificial intelligence that enables computers to understand, interpret, and generate human language in useful ways.

Attention Mechanism

The attention mechanism is a technique in neural networks that lets the model dynamically focus on the most relevant parts of the input when processing each element, rather than treating all inputs equally.

Context Length

Context length is the maximum number of tokens an LLM can process in a single input at once, acting as its effective memory window.

Context Window

A context window is the maximum number of tokens an LLM can process together in one pass, including the user's input and any conversation history.