What is Tokenization?
Tokenization is the process of breaking text into smaller units called tokens that language models can process numerically.
It converts raw strings into sequences of tokens using algorithms such as word splitting, subword methods like Byte Pair Encoding (BPE), or character-level segmentation. Each token is then mapped to an ID from a fixed vocabulary.
Subword tokenization balances vocabulary size and coverage, allowing models to handle rare words by composing them from common pieces while keeping common words as single tokens.
The resulting token sequence is what gets turned into embeddings and fed into transformer layers during both training and inference.
Example
The sentence 'ChatGPT is helpful!' might become the tokens ['Chat', 'G', 'PT', ' is', ' helpful', '!'] depending on the tokenizer, each mapped to a numeric ID.
Why it matters
Tokenization is the essential first step that lets LLMs turn human language into numbers they can learn from, directly affecting model efficiency, vocabulary size, and handling of different languages.
Frequently asked questions
A token is a chunk of text—often a word or part of a word—that the model treats as a single unit.
Related terms
An embedding (or vector embedding) is a way to represent words, sentences, or other data as dense numerical vectors in a high-dimensional space so that similar items end up close together.
A Transformer is a neural network architecture that processes sequential data like text using self-attention to weigh relationships between all parts of the input at once.
Natural Language Processing (NLP) is a branch of artificial intelligence that enables computers to understand, interpret, and generate human language in useful ways.
The attention mechanism is a technique in neural networks that lets the model dynamically focus on the most relevant parts of the input when processing each element, rather than treating all inputs equally.
Context length is the maximum number of tokens an LLM can process in a single input at once, acting as its effective memory window.
A context window is the maximum number of tokens an LLM can process together in one pass, including the user's input and any conversation history.