What is Speech-to-Text?
Also known as: STT
Speech-to-Text (STT) is an AI technology that converts spoken audio into written text. It is a core task in natural language processing (NLP).
STT systems first capture audio input and break it into small sound units. Acoustic models then map these sounds to possible phonemes or letters while filtering out noise and handling variations like accents.
Next, language models use context and grammar rules to assemble the sounds into coherent words and sentences. Modern systems rely on deep neural networks, especially transformers, trained on massive speech datasets for higher accuracy.
Key challenges include real-time processing, punctuation insertion, and robustness to different speakers, languages, and environments.
Example
When you speak a text message into your phone's keyboard, STT instantly turns your words into typed text that can be sent or edited.
Why it matters
STT powers voice assistants, live captions, and accessibility tools, making technology more natural and inclusive for millions of users worldwide.
Frequently asked questions
Voice recognition identifies who is speaking, while STT focuses on what is being said by converting speech to text.
Related terms
Natural Language Processing (NLP) is a branch of artificial intelligence that enables computers to understand, interpret, and generate human language in useful ways.
Text-to-Speech (TTS) is a generative AI system that converts written text into natural-sounding spoken audio.
Deep Learning is a subset of machine learning that uses multi-layered artificial neural networks to automatically learn complex patterns from large datasets.
A neural network, or artificial neural network (ANN), is a computational model inspired by the human brain that learns to recognize patterns in data by passing information through layers of interconnected artificial neurons.
Beam search is a decoding algorithm used in NLP to generate sequences like sentences by exploring multiple high-probability paths instead of just one.
An embedding (or vector embedding) is a way to represent words, sentences, or other data as dense numerical vectors in a high-dimensional space so that similar items end up close together.