TTS Models: A Map Of The Modern Speech Stack
Text-to-speech used to be easy to sketch.
text -> acoustic model -> vocoder -> waveform
That sketch still describes many good systems. It also hides most of what has changed.
The newer TTS papers are not just swapping one Transformer for another. They are choosing different objects for the model to predict: mel frames, STFT frames, waveform latents, codec tokens, semantic tokens, residual codebook stacks, continuous VAE latents, text-aligned acoustic vectors, or some cascade of those pieces.
Once that object is chosen, the rest of the system inherits its contract. A model that predicts mels needs a clock and a vocoder. A model that predicts EnCodec-style tokens inherits the codec frame rate, codebook depth, and decoder. A model that splits semantic and acoustic streams has to keep the streams in agreement. A model that predicts continuous latents needs a sampler or distribution head instead of a softmax.
That is the cleanest way to organize the current TTS stack.

Original generated overview figure. The figure compresses the survey into five architecture families across 2022-2026: acoustic/vocoder systems, early speech-token hybrids, discrete codec LMs, semantic/acoustic cascades, and continuous-latent generators.
The model names are noisy, but the interfaces repeat.
The Five Families
The survey behind this post covers 77 TTS and speech-generation systems from 2022 through 2026. The list is not a leaderboard. It is a map of recurring design contracts.
| Family | Count | What the model usually predicts | Representative systems |
|---|---|---|---|
| Acoustic/vocoder and VITS-style generators | 22 | Mel frames, STFTs, waveform, or integrated VITS-style acoustic latents | Nix-TTS, NaturalSpeech, StyleTTS, ProDiff, MB-iSTFT-VITS, OverFlow, MMS-TTS, Mega-TTS, Voicebox, StyleTTS 2, Mega-TTS 2, VITS2, Matcha-TTS, OpenVoice, E2 TTS, M2SE-VTTS, F5-TTS, ZipVoice, UniVoice, ParaStyleTTS, LEMAS-TTS, Raon-OpenTTS |
| Early speech-token hybrids | 2 | Internal VQ-VAE or mel-speech tokens before modern neural-codec TTS stabilizes | TorToise, XTTS |
| Discrete codec LMs and masked codec models | 19 | One or more discrete acoustic-code streams decoded by a neural codec | VALL-E, VALL-E X, Parler-TTS, BASE TTS, VoiceCraft, Mini-Omni, SSR-Speech, LLaSA, Koel-TTS / MagpieTTS, IndexTTS, LLMVoX, Voila, VoiceStar, Inworld TTS-1, Kyutai DSM TTS, Qwen3-TTS, MOSS-TTS, OmniVoice, T5Gemma-TTS |
| Semantic/acoustic and factorized-codec cascades | 23 | Content, speaker/style, duration, global, or acoustic-detail streams before waveform reconstruction | SPEAR-TTS, Make-A-Voice, NaturalSpeech 3, Seed-TTS, CosyVoice, MaskGCT, Fish-Speech, CosyVoice 2, Metis, Spark-TTS, Kimi-Audio, CosyVoice 3, OpenAudio S1 / FishAudio S1, IndexTTS2, Marco-Voice, BatonVoice / BatonTTS, SoulX-Podcast, GLM-TTS, GPA, Fish Audio S2, Voxtral TTS, VoXtream, VoXtream2 |
| Continuous-latent audio-autoencoder generators | 11 | Continuous codec or VAE latents, sampled by diffusion, flow, next-token diffusion, or a continuous AR head | NaturalSpeech 2, MegaTTS 3, SupertonicTTS, SLED-TTS, VIBEVOICE, Pocket TTS / CALM, LAM-TTS / LARoPE, VoxCPM, TADA, Ming-Omni-TTS, Audio-Omni |
Those buckets are intentionally about interfaces, not about brand names. A "flow" model can predict mels, codec latents, or acoustic-detail streams. An "autoregressive" model can emit one token per frame, a stack of residual codebook decisions, or a continuous latent sampled by a local head. The family name only becomes useful when it says what object crosses the boundary.
The Object Being Predicted
Speech generation has many possible intermediate objects. Each one keeps some information cheap and makes other information expensive.

Original generated overview figure. The ladder shows common prediction targets: text, a duration/alignment clock, mel or STFT frames, codec-token matrices, semantic/acoustic split streams, continuous latents, and waveform output.
A mel model predicts a dense acoustic surface. It usually needs a duration model or alignment mechanism to decide how many frames each text segment gets. It then hands those frames to a vocoder.
A codec-LM model predicts symbols from a neural audio codec. The waveform path is mostly delegated to the codec decoder. The generator can look like a normal language model, but the vocabulary is not text. It is the codec's opinion about speech.
A semantic/acoustic cascade splits the job. One stream carries words or phonetic content. Another stream carries speaker, prosody, and acoustic detail. This makes the model more controllable when it works, but the cascade can also create hidden disagreement between stages.
A continuous-latent model avoids discrete code IDs for the main acoustic target. That can be attractive when codebook depth becomes awkward, but a continuous target still needs a sampler, a flow, a diffusion head, an implicit distribution head, or another way to choose a point in a continuous space.
The output object is the first serious design choice.
Acoustic Models Are Not Obsolete
The oldest family in the survey is also the easiest to underestimate.
Acoustic/vocoder systems predict mels, STFTs, waveform, or VITS-style latent representations. Nix-TTS, NaturalSpeech, StyleTTS, ProDiff, MB-iSTFT-VITS, OverFlow, MMS-TTS, Mega-TTS, Voicebox, StyleTTS 2, Mega-TTS 2, VITS2, Matcha, OpenVoice, E2 TTS, F5-TTS, ZipVoice, UniVoice, LEMAS-TTS, and Raon-OpenTTS all belong somewhere in this older-looking branch.
That does not mean the branch stopped changing.
The transition from older duration/mel recipes to flow-matching mel systems is real. Voicebox, Matcha-TTS, E2 TTS, F5-TTS, ZipVoice, LEMAS-TTS, and Raon-OpenTTS all keep a spectrogram-like or mel-like target while changing the generator. The acoustic interface stays familiar, but the model used to fill it changes.
This family has a practical advantage: the boundary is understandable. There is a text side, a clock, an acoustic frame sequence, and a waveform decoder. For paired TTS, that can be a strong engineering position. It gives a builder places to debug:
text normalization -> phonemes -> durations -> acoustic frames -> vocoder
The cost is that the model must solve timing directly. If the clock is wrong, the audio can skip words, repeat words, rush punctuation, or stretch phrases even when the vocoder is excellent.
This is why length modeling keeps reappearing in modern papers under different names. A duration predictor, a target token count, a speech-token clock, a duration-state token, a chunk schedule, and a text-aligned latent rate all do related work. They answer the same question:
Where does the next piece of text land in time?
That question does not disappear when the model becomes larger.
The Codec-LM Turn
VALL-E changed the common mental model for TTS. Instead of predicting mels and then vocoding, it used EnCodec tokens as the speech representation. Text-conditioned TTS became closer to:
text + prompt codec tokens -> language model -> new codec tokens -> codec decoder
That move is powerful. It lets TTS borrow the scaling habits of language models: large corpora, next-token prediction, prompt conditioning, and autoregressive sampling. VALL-E, VALL-E X, Parler-TTS, VoiceCraft, LLaSA, Koel-TTS, IndexTTS, LLMVoX, Voila, VoiceStar, Inworld TTS-1, Kyutai DSM TTS, Qwen3-TTS, MOSS-TTS, OmniVoice, and T5Gemma-TTS all live in or near this codec-token branch.
The hidden tradeoff is that the codec becomes part of the model.

Original generated overview figure. The map shows tokenizer and codec lineages as shared infrastructure feeding codec LMs, semantic/acoustic cascades, and continuous-latent systems.
A codec token is not a neutral unit. It has a sample rate, a frame rate, a codebook count, a codebook size, a decoder, a training corpus, and a failure profile. If two TTS models use different codecs, their metrics compare generators and codec interfaces at the same time.
This is why "codec LM" is too broad as a final label. There are several different codec contracts:
- EnCodec-style RVQ systems use multiple residual codebooks per frame.
- DAC-like systems often inherit a different decoder and audio bandwidth profile.
- XCodec2-style systems push toward low-rate single-code or compact token interfaces.
- Mimi-style systems optimize for low-rate streaming speech tokens.
- Higgs, SNAC, MOSS-Audio-Tokenizer, Qwen tokenizers, and Voxtral Codec define their own interface contracts.
The generator is only one part of the system recipe. A strong codec can raise the ceiling. A weak or mismatched codec can impose artifacts that no generator can fully remove.
Flat Tokens Were Not Enough
Flat codec tokens make TTS look like text generation. But speech is not a text sequence with a speaker label attached.
The model has to carry:
- words
- speaker identity
- accent and pronunciation
- prosody and speaking rate
- emotion or style
- room tone, reverb, mic color, and channel texture
- long-form state
- fine acoustic detail
Putting all of that through one stream can work, but it creates pressure. The next wave of systems splits the job.
SPEAR-TTS uses semantic tokens and acoustic tokens as separate stages. NaturalSpeech 3 uses factorized codec streams. CosyVoice and CosyVoice 2 use supervised semantic speech tokenizers plus acoustic reconstruction paths. MaskGCT uses masked generation over semantic and acoustic representations. Voxtral TTS uses one semantic VQ token plus acoustic FSQ values per frame. VoXtream and VoXtream2 use Mimi-style speech tokens in streaming pipelines.
These systems are not all doing the same thing. But they share one idea: content and acoustic detail do not have to travel through the same channel.

Original generated overview figure. This is a conceptual map of where different information types tend to live across architecture families. It is not a scorecard; real systems mix patterns.
The split creates new debugging questions.
If a model says the right words but the wrong speaker, which stage lost identity? If it keeps the speaker but drops a phrase, did the semantic stream fail, or did the acoustic stage hide a semantic error behind plausible audio? If prompt denoising improves naturalness while lowering speaker similarity, did the system lose voice identity, or did it remove room tone that the similarity metric was rewarding?
The output waveform can sound plausible while the intermediate streams disagree. That is why cascaded systems need agreement diagnostics, not only end-point MOS, WER, and speaker similarity.
A useful probe is:
intended text
vs. intermediate semantic tokens or predicted transcript
vs. ASR on final waveform
vs. speaker and channel probes
The exact probe changes by model, but the principle holds. Once the system has separate carriers, measure whether they still agree.
Prompt Audio Is An Overloaded Bus
Voice cloning makes this harder.
Prompt audio is rarely just "the speaker." It also contains microphone color, room tone, reverberation, speaking rate, emotion, accent, breath noise, recording quality, and background texture. A zero-shot TTS system can copy too much, too little, or the wrong part of that bundle.
This is visible across many families. VALL-E-style systems condition on prompt codec tokens. SPEAR-TTS and Make-A-Voice pass prompt tokens through semantic and acoustic stages. VoXtream2 combines Mimi prompt tokens with a separate speaker embedding. OmniVoice exposes denoising and voice-design controls in its release interface. VIBEVOICE separates acoustic and semantic views of generated history.
The builder question is not just:
Does the cloned voice sound similar?
It is:
Which signal did the prompt carry?
identity, prosody, style, accent, room, mic, noise, rate, or all of them?
A simple factorial prompt test can reveal the difference:
| Prompt pair | What it tests |
|---|---|
| Same speaker, different rooms | Whether room tone leaks into identity |
| Different speakers, same room | Whether channel color dominates similarity |
| Same speaker, different emotion | Whether style transfer is controlled or accidental |
| Clean prompt vs. reverberant prompt | Whether denoising changes identity metrics |
| Same words, different speaking rate | Whether prompt rate controls output rate |
This is especially important for evaluation. Speaker similarity metrics can reward prompt-channel artifacts. A system that strips reverb may sound cleaner but score worse if the metric treats the recording condition as part of the speaker match.
Continuous Latents Return
Discrete codec tokens were a major step, but they are not the end of the story.
Continuous-latent systems try a different tradeoff. Instead of forcing speech through discrete IDs, they generate continuous latent vectors from an audio autoencoder, VAE, or codec-like representation.
NaturalSpeech 2 uses continuous codec latents with diffusion. MegaTTS 3 uses WaveVAE latents with rectified flow. VIBEVOICE uses next-token diffusion over continuous acoustic latents for long-form podcast-like generation. Pocket TTS / CALM uses continuous VAE-GAN latents. TADA uses continuous text-aligned acoustic vectors. Audio-Omni uses a pretrained VAE/oobleck-style autoencoder with rectified flow.
The attraction is clear. A continuous target avoids some codebook-utilization and residual-depth problems. It can also make long-form low-rate state more natural, because the model is not forced to choose from a fixed acoustic vocabulary at every step.
The cost is also clear. A continuous latent is not generated by a normal softmax. The model must pay through some other mechanism:
- diffusion steps
- flow or ODE solver steps
- an implicit distribution head
- a continuous autoregressive sampler
- a latent decoder that may or may not be causal
Continuous latents remove one bottleneck and introduce another.
The Decision Budget
Many papers advertise a low frame rate. That number matters, but it is not the whole compute story.
A 12.5 Hz codec sounds cheap until each frame carries 16 or 32 residual codebook decisions. A non-autoregressive model sounds cheap until it needs many flow or diffusion steps. A semantic/acoustic cascade sounds clean until the system runs a semantic model, an acoustic model, and a decoder. A streaming system sounds fast until chunk overlap, lookahead, and decoder buffering are counted.

Original generated overview figure. The same acoustic-detail work can be paid as frame rate, codebook depth, local residual loops, cascade passes, solver steps, or decoder cost.
A better comparison is a decision ledger.
For each system, record:
frame rate
codebooks or streams per frame
local residual-depth passes
semantic/acoustic cascade passes
diffusion or flow steps
decoder or vocoder latency
chunk overlap and lookahead
hardware and batching
Then compare WER, speaker similarity, naturalness, latency, and long-form drift against that ledger. The point is not to reduce everything to one scalar. The point is to avoid pretending that low Hz alone means low work.
MOSS-TTS is the cleanest example. It operates at a low 12.5 frames per second, but one frame can include a deep RVQ stack. The design question becomes how to model residual depth: delay pattern, local transformer, or another allocation of current-frame and long-context state.
Voxtral TTS creates a different ledger: one semantic token plus many acoustic FSQ values, with flow matching used for the acoustic part. Continuous-latent systems move cost into sampler steps and decoder calibration. Flow-matching mel systems move cost into NFE and vocoding.
Work moves. It does not disappear.
Streaming Is A Whole Pipeline Property
Autoregression does not automatically mean streaming. A model can generate tokens left to right and still require the full text, a non-causal duration plan, a whole-utterance acoustic sampler, a non-causal decoder, or a large buffer before audio can start.
Streaming TTS is an end-to-end causality problem.

Original generated overview figure. A streaming claim should name the bounded lookahead and timing contribution of text arrival, planning, acoustic-unit generation, decoding, chunking, and first-audio delivery.
A useful streaming report separates:
text wait time
normalization and phoneme lookahead
duration or semantic planning
acoustic unit generation
codec or vocoder decoding
chunk scheduling
network/client buffering
That gives two different measurements:
- TTFC or TTFA: how long until the first audible chunk is ready.
- RTF: steady-state wall-clock time per second of generated audio.
Those numbers are not interchangeable. A system can have good RTF but poor time-to-first-audio if it waits for a full sentence. Another system can produce quick first audio but accumulate prosody debt, boundary artifacts, or correction problems when late punctuation or new text changes the intended phrasing.
This is why systems such as LLMVoX, Qwen3-TTS, GPA, VoXtream, VoXtream2, Voxtral TTS, Kyutai DSM TTS, and Fish Audio S2 are interesting beyond their model labels. They try to make more of the pipeline causal or chunk-safe. The right comparison is not "AR vs NAR." It is where the non-causal boundary sits.
The Clock Is A Model Component
The subproblem that keeps reappearing is the clock.
Text arrives as symbols. Speech leaves as time. Somewhere in the system, text positions have to become acoustic time positions.
Older mel systems expose this directly through alignments, duration predictors, or length regulators. Codec LMs can hide it inside token generation, but then they can skip, repeat, or drift. Masked systems often need a target length. Streaming systems need partial clocks. Text-aligned latent systems make the clock explicit again.
This makes duration and progress tracking a first-class subsystem.
A good duration-stress evaluation would include:
- fast and slow style prompts
- heavy punctuation
- long clauses
- code switching
- numbers, acronyms, and names
- late-arriving text in streaming mode
- repeated phrases designed to trigger loops
Then measure:
- duration error
- skipped and repeated words
- pause placement
- speaking-rate drift
- alignment entropy
- local WER around punctuation and clause boundaries
This is not only an evaluation concern. It changes API design. A system that has a native duration or speaking-rate control can expose it honestly. A system that only has sampler temperature should not pretend the same control exists.
The Codec Oracle Gap
When a TTS model uses a codec, tokenizer, vocoder, or audio autoencoder, the first evaluation row should be the oracle reconstruction path:
ground truth audio -> tokenizer / encoder -> decoder -> reconstructed audio
Then evaluate the generated path:
text -> generator -> generated units -> decoder -> generated audio
The difference between those rows is the gap the generator is responsible for. Without the oracle row, a TTS result can be a tokenizer result, a decoder result, a sample-rate result, or a data-filtering result in disguise.

Original generated overview figure. The codec-oracle path and component-swap matrix help attribute gains to the tokenizer, generator, or decoder instead of crediting the whole stack to one architecture label.
The same logic applies to cross-model claims. If one paper changes the tokenizer, generator, data, decoder, sample rate, prompt policy, and evaluation set, the result is a system recipe. It may be a good system recipe, but it is not clean evidence that one generator family beat another.
The cleaner artifact is a component-swap matrix:
| Hold constant | Swap | What it tells you |
|---|---|---|
| Same tokenizer and decoder | Generator | Whether AR, masked, flow, diffusion, or continuous heads help on the same units |
| Same generator and decoder | Tokenizer | Whether the representation improved the task |
| Same generated units | Decoder | Whether waveform quality changed below the generator |
| Same eval and data recipe | Inference policy | Whether sampling, guidance, chunking, or filtering caused the gain |
Few papers expose enough to run the whole matrix. That is exactly why the question is useful. It prevents architecture names from carrying more evidence than the experiment supports.
How To Choose A Family
No family wins every setting.
For a controllable paired-TTS product, acoustic/vocoder or flow-mel systems are still attractive. They have clear timing machinery and a familiar waveform path. They are easier to debug when pronunciation, duration, or prosody is the main problem.
For zero-shot voice cloning with large-scale pretraining, codec LMs are natural. They turn speech generation into token modeling and make prompt audio easy to include. The risk is that the codec interface, not just the LM, defines quality and latency.
For expressive cloning, multilingual generation, style control, and dialogue, semantic/acoustic cascades become attractive. They give separate places for content and acoustic detail. The risk is disagreement between stages and harder evaluation.
For long-form speech, podcast generation, and systems where discrete codebook depth becomes awkward, continuous latents are a serious branch. They can carry compact state through time, but they need careful sampler and decoder instrumentation.
For streaming, ignore the family label until the pipeline is traced. The system needs bounded lookahead from text arrival to first audio. A non-causal stage anywhere in the path can block streaming.
Practical Comparison Checklist
When comparing two TTS papers, ask these questions before looking at MOS:
- What object does the generator predict?
- Where does the clock live?
- Is prompt audio an identity signal, a style signal, a channel signal, or all of them?
- Is the tokenizer reused, frozen, trained by the paper, or ambiguous?
- What is the codec-oracle reconstruction floor?
- How many acoustic decisions are made per second after codebooks, solver steps, cascades, and decoders are counted?
- Which parts of the pipeline are causal or chunk-safe?
- Are baselines matched for data, tokenizer, decoder, sample rate, and inference policy?
- What happens on long-form boundary tests?
- Which control knobs are native, and which are only prompt-implied?
This checklist is often more useful than another architecture diagram. It turns the paper from a name into an engineering contract.
Technical Takeaways
1. Interface contracts beat generator labels
The most useful first question is not whether a model is AR, NAR, diffusion, or flow. It is what crosses the generator boundary.
The generated object defines the rest of the stack: duration machinery for mels, codebook depth for codec tokens, agreement checks for semantic/acoustic cascades, and sampler or decoder debt for continuous latents. Two systems with the same backbone label can have different failure modes because their interfaces are different.
The builder diagnostic is a component ledger: prediction target, frame or latent rate, streams per frame, waveform path, and which module owns speaker, prosody, and duration.
2. Every TTS system needs a clock
Speech is text placed onto time. Systems differ in where they build that clock: duration predictors, alignment models, target token counts, AR token progress, masked target lengths, duration-state tokens, text-aligned latents, or chunk schedules.
This is not a minor inference detail. Clock errors become skipped words, repetitions, awkward pauses, speaking-rate drift, and streaming instability. The useful test is a duration-stress suite with punctuation, long clauses, rate control, repeated phrases, and partial text arrival.
3. Report the oracle gap and the decision budget
Codec and latent systems need two extra accounting rows.
First, report the codec-oracle reconstruction floor:
reference audio -> tokenizer / decoder -> reconstructed audio
Second, report the decision budget:
frame rate * streams * local passes * solver steps + decoder cost
Those rows stop the post-hoc story from crediting the wrong component. A model may win because the generator is better, because the tokenizer is better, because the decoder is better, because the data is larger, or because the eval setup changed. The architecture claim should only be as strong as the controlled comparison.
The Shape Of The Stack
The modern TTS stack is less chaotic than the model list suggests.
Most systems are arranging the same responsibilities: text content, time, speaker identity, prosody, acoustic detail, channel texture, long-form state, and waveform reconstruction. The differences come from where each responsibility lives.
Mels keep the acoustic target visible. Codec LMs turn speech into language modeling over audio codes. Semantic/acoustic cascades split content from detail. Continuous latents trade discrete codebooks for samplers and latent decoders. Streaming systems force the whole chain to respect bounded lookahead.
That is the practical map. Choose the interface first, then judge the model.
Sources
- VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
- SPEAR-TTS: Text-to-Speech with Discrete Speech Tokens
- VALL-E X: Speak Foreign Languages with Your Own Voice
- NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers
- Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
- StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
- Matcha-TTS: A Fast TTS Architecture with Conditional Flow Matching
- NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
- CosyVoice: A Scalable Multilingual Zero-Shot Text-to-Speech Synthesizer based on Supervised Semantic Tokens
- MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
- F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
- CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
- MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
- VibeVoice: Expressive Podcast Generation with Next-Token Diffusion
- LEMAS-TTS: A Large-Scale Evaluation and Modeling Suite for Open Text-to-Speech
- Qwen3-TTS: Technical Report
- MOSS-TTS: Technical Report
- Voxtral TTS: Technical Report
- VoXtream2: Full-stream Text-to-Speech with Extremely Low Latency
- OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
- T5Gemma-TTS: Investigating T5Gemma for Japanese Text-to-Speech
- Audio-Omni: Audio-Language Models Can Be Audio Generators Too
- Raon-OpenTTS: Open Data, Open Model, Open TTS
- Related Diffio post: VibeVoice TTS: Next-Token Diffusion
- Related Diffio post: MOSS-TTS: Modeling The RVQ Stack