What is Text-to-Speech?
Also known as: TTS
Text-to-Speech (TTS) is a generative AI system that converts written text into natural-sounding spoken audio.
Modern TTS models are trained on large datasets of human speech paired with text. They learn to predict acoustic features such as pitch, rhythm, and timbre directly from the input text.
The process typically involves two stages: a text-to-spectrogram network that generates a mel-spectrogram representation of the audio, followed by a vocoder that converts the spectrogram into waveform samples.
Recent advances use end-to-end neural architectures (e.g., Tacotron, FastSpeech, VITS) that produce more expressive and human-like voices with less manual feature engineering.
Example
When you ask a smart speaker 'What's the weather today?', the device uses TTS to turn the weather report text into a spoken reply that sounds like a natural human voice.
Why it matters
TTS powers accessible interfaces for the visually impaired, enables hands-free interaction in voice assistants, and supports scalable creation of audiobooks, videos, and multilingual content.
Frequently asked questions
No. TTS generates new audio from text using AI models rather than playing back pre-recorded human speech.
Related terms
Diffusion is a generative modeling approach that creates new data samples by learning to reverse a gradual noising process. It starts from pure random noise and iteratively removes noise to produce realistic outputs like images or audio.
A diffusion model is a generative AI technique that creates new data like images by learning to reverse a gradual noising process applied to training examples.
A Generative Adversarial Network (GAN) is a machine learning model made of two neural networks that compete against each other to generate realistic new data, such as images or text.
Generative AI (GenAI) is artificial intelligence that learns patterns from data to create new, original content such as text, images, audio, or code.
A multimodal model is a generative AI system that can process and create content across multiple data types, such as text, images, audio, or video, within a single model.
Stable Diffusion is a generative AI model that creates images from text prompts by reversing a gradual noising process.