Skip to content
Sign in

What is Tokenization?

Tokenization is the process of breaking text into smaller units called tokens that language models can process numerically.

It converts raw strings into sequences of tokens using algorithms such as word splitting, subword methods like Byte Pair Encoding (BPE), or character-level segmentation. Each token is then mapped to an ID from a fixed vocabulary.

Subword tokenization balances vocabulary size and coverage, allowing models to handle rare words by composing them from common pieces while keeping common words as single tokens.

The resulting token sequence is what gets turned into embeddings and fed into transformer layers during both training and inference.

Example

The sentence 'ChatGPT is helpful!' might become the tokens ['Chat', 'G', 'PT', ' is', ' helpful', '!'] depending on the tokenizer, each mapped to a numeric ID.

Why it matters

Tokenization is the essential first step that lets LLMs turn human language into numbers they can learn from, directly affecting model efficiency, vocabulary size, and handling of different languages.

Frequently asked questions

A token is a chunk of text—often a word or part of a word—that the model treats as a single unit.