TTS Models: A Map Of The Modern Speech Stack

Share
TTS Models: A Map Of The Modern Speech Stack

Text-to-speech used to be easy to sketch.

text -> acoustic model -> vocoder -> waveform

That sketch still describes many good systems. It also hides most of what has changed.

The newer TTS papers are not just swapping one Transformer for another. They are choosing different objects for the model to predict: mel frames, STFT frames, waveform latents, codec tokens, semantic tokens, residual codebook stacks, continuous VAE latents, text-aligned acoustic vectors, or some cascade of those pieces.

Once that object is chosen, the rest of the system inherits its contract. A model that predicts mels needs a clock and a vocoder. A model that predicts EnCodec-style tokens inherits the codec frame rate, codebook depth, and decoder. A model that splits semantic and acoustic streams has to keep the streams in agreement. A model that predicts continuous latents needs a sampler or distribution head instead of a softmax.

That is the cleanest way to organize the current TTS stack.

Architecture taxonomy map

Original generated overview figure. The figure compresses the survey into five architecture families across 2022-2026: acoustic/vocoder systems, early speech-token hybrids, discrete codec LMs, semantic/acoustic cascades, and continuous-latent generators.

The model names are noisy, but the interfaces repeat.

The Five Families

The survey behind this post covers 77 TTS and speech-generation systems from 2022 through 2026. The list is not a leaderboard. It is a map of recurring design contracts.

Family Count What the model usually predicts Representative systems
Acoustic/vocoder and VITS-style generators 22 Mel frames, STFTs, waveform, or integrated VITS-style acoustic latents Nix-TTS, NaturalSpeech, StyleTTS, ProDiff, MB-iSTFT-VITS, OverFlow, MMS-TTS, Mega-TTS, Voicebox, StyleTTS 2, Mega-TTS 2, VITS2, Matcha-TTS, OpenVoice, E2 TTS, M2SE-VTTS, F5-TTS, ZipVoice, UniVoice, ParaStyleTTS, LEMAS-TTS, Raon-OpenTTS
Early speech-token hybrids 2 Internal VQ-VAE or mel-speech tokens before modern neural-codec TTS stabilizes TorToise, XTTS
Discrete codec LMs and masked codec models 19 One or more discrete acoustic-code streams decoded by a neural codec VALL-E, VALL-E X, Parler-TTS, BASE TTS, VoiceCraft, Mini-Omni, SSR-Speech, LLaSA, Koel-TTS / MagpieTTS, IndexTTS, LLMVoX, Voila, VoiceStar, Inworld TTS-1, Kyutai DSM TTS, Qwen3-TTS, MOSS-TTS, OmniVoice, T5Gemma-TTS
Semantic/acoustic and factorized-codec cascades 23 Content, speaker/style, duration, global, or acoustic-detail streams before waveform reconstruction SPEAR-TTS, Make-A-Voice, NaturalSpeech 3, Seed-TTS, CosyVoice, MaskGCT, Fish-Speech, CosyVoice 2, Metis, Spark-TTS, Kimi-Audio, CosyVoice 3, OpenAudio S1 / FishAudio S1, IndexTTS2, Marco-Voice, BatonVoice / BatonTTS, SoulX-Podcast, GLM-TTS, GPA, Fish Audio S2, Voxtral TTS, VoXtream, VoXtream2
Continuous-latent audio-autoencoder generators 11 Continuous codec or VAE latents, sampled by diffusion, flow, next-token diffusion, or a continuous AR head NaturalSpeech 2, MegaTTS 3, SupertonicTTS, SLED-TTS, VIBEVOICE, Pocket TTS / CALM, LAM-TTS / LARoPE, VoxCPM, TADA, Ming-Omni-TTS, Audio-Omni

Those buckets are intentionally about interfaces, not about brand names. A "flow" model can predict mels, codec latents, or acoustic-detail streams. An "autoregressive" model can emit one token per frame, a stack of residual codebook decisions, or a continuous latent sampled by a local head. The family name only becomes useful when it says what object crosses the boundary.

The Object Being Predicted

Speech generation has many possible intermediate objects. Each one keeps some information cheap and makes other information expensive.

Speech interface ladder

Original generated overview figure. The ladder shows common prediction targets: text, a duration/alignment clock, mel or STFT frames, codec-token matrices, semantic/acoustic split streams, continuous latents, and waveform output.

A mel model predicts a dense acoustic surface. It usually needs a duration model or alignment mechanism to decide how many frames each text segment gets. It then hands those frames to a vocoder.

A codec-LM model predicts symbols from a neural audio codec. The waveform path is mostly delegated to the codec decoder. The generator can look like a normal language model, but the vocabulary is not text. It is the codec's opinion about speech.

A semantic/acoustic cascade splits the job. One stream carries words or phonetic content. Another stream carries speaker, prosody, and acoustic detail. This makes the model more controllable when it works, but the cascade can also create hidden disagreement between stages.

A continuous-latent model avoids discrete code IDs for the main acoustic target. That can be attractive when codebook depth becomes awkward, but a continuous target still needs a sampler, a flow, a diffusion head, an implicit distribution head, or another way to choose a point in a continuous space.

The output object is the first serious design choice.

Acoustic Models Are Not Obsolete

The oldest family in the survey is also the easiest to underestimate.

Acoustic/vocoder systems predict mels, STFTs, waveform, or VITS-style latent representations. Nix-TTS, NaturalSpeech, StyleTTS, ProDiff, MB-iSTFT-VITS, OverFlow, MMS-TTS, Mega-TTS, Voicebox, StyleTTS 2, Mega-TTS 2, VITS2, Matcha, OpenVoice, E2 TTS, F5-TTS, ZipVoice, UniVoice, LEMAS-TTS, and Raon-OpenTTS all belong somewhere in this older-looking branch.

That does not mean the branch stopped changing.

The transition from older duration/mel recipes to flow-matching mel systems is real. Voicebox, Matcha-TTS, E2 TTS, F5-TTS, ZipVoice, LEMAS-TTS, and Raon-OpenTTS all keep a spectrogram-like or mel-like target while changing the generator. The acoustic interface stays familiar, but the model used to fill it changes.

This family has a practical advantage: the boundary is understandable. There is a text side, a clock, an acoustic frame sequence, and a waveform decoder. For paired TTS, that can be a strong engineering position. It gives a builder places to debug:

text normalization -> phonemes -> durations -> acoustic frames -> vocoder

The cost is that the model must solve timing directly. If the clock is wrong, the audio can skip words, repeat words, rush punctuation, or stretch phrases even when the vocoder is excellent.

This is why length modeling keeps reappearing in modern papers under different names. A duration predictor, a target token count, a speech-token clock, a duration-state token, a chunk schedule, and a text-aligned latent rate all do related work. They answer the same question:

Where does the next piece of text land in time?

That question does not disappear when the model becomes larger.

The Codec-LM Turn

VALL-E changed the common mental model for TTS. Instead of predicting mels and then vocoding, it used EnCodec tokens as the speech representation. Text-conditioned TTS became closer to:

text + prompt codec tokens -> language model -> new codec tokens -> codec decoder

That move is powerful. It lets TTS borrow the scaling habits of language models: large corpora, next-token prediction, prompt conditioning, and autoregressive sampling. VALL-E, VALL-E X, Parler-TTS, VoiceCraft, LLaSA, Koel-TTS, IndexTTS, LLMVoX, Voila, VoiceStar, Inworld TTS-1, Kyutai DSM TTS, Qwen3-TTS, MOSS-TTS, OmniVoice, and T5Gemma-TTS all live in or near this codec-token branch.

The hidden tradeoff is that the codec becomes part of the model.

Codec lineage map

Original generated overview figure. The map shows tokenizer and codec lineages as shared infrastructure feeding codec LMs, semantic/acoustic cascades, and continuous-latent systems.

A codec token is not a neutral unit. It has a sample rate, a frame rate, a codebook count, a codebook size, a decoder, a training corpus, and a failure profile. If two TTS models use different codecs, their metrics compare generators and codec interfaces at the same time.

This is why "codec LM" is too broad as a final label. There are several different codec contracts:

  • EnCodec-style RVQ systems use multiple residual codebooks per frame.
  • DAC-like systems often inherit a different decoder and audio bandwidth profile.
  • XCodec2-style systems push toward low-rate single-code or compact token interfaces.
  • Mimi-style systems optimize for low-rate streaming speech tokens.
  • Higgs, SNAC, MOSS-Audio-Tokenizer, Qwen tokenizers, and Voxtral Codec define their own interface contracts.

The generator is only one part of the system recipe. A strong codec can raise the ceiling. A weak or mismatched codec can impose artifacts that no generator can fully remove.

Flat Tokens Were Not Enough

Flat codec tokens make TTS look like text generation. But speech is not a text sequence with a speaker label attached.

The model has to carry:

  • words
  • speaker identity
  • accent and pronunciation
  • prosody and speaking rate
  • emotion or style
  • room tone, reverb, mic color, and channel texture
  • long-form state
  • fine acoustic detail

Putting all of that through one stream can work, but it creates pressure. The next wave of systems splits the job.

SPEAR-TTS uses semantic tokens and acoustic tokens as separate stages. NaturalSpeech 3 uses factorized codec streams. CosyVoice and CosyVoice 2 use supervised semantic speech tokenizers plus acoustic reconstruction paths. MaskGCT uses masked generation over semantic and acoustic representations. Voxtral TTS uses one semantic VQ token plus acoustic FSQ values per frame. VoXtream and VoXtream2 use Mimi-style speech tokens in streaming pipelines.

These systems are not all doing the same thing. But they share one idea: content and acoustic detail do not have to travel through the same channel.

Information carrier map

Original generated overview figure. This is a conceptual map of where different information types tend to live across architecture families. It is not a scorecard; real systems mix patterns.

The split creates new debugging questions.

If a model says the right words but the wrong speaker, which stage lost identity? If it keeps the speaker but drops a phrase, did the semantic stream fail, or did the acoustic stage hide a semantic error behind plausible audio? If prompt denoising improves naturalness while lowering speaker similarity, did the system lose voice identity, or did it remove room tone that the similarity metric was rewarding?

The output waveform can sound plausible while the intermediate streams disagree. That is why cascaded systems need agreement diagnostics, not only end-point MOS, WER, and speaker similarity.

A useful probe is:

intended text
  vs. intermediate semantic tokens or predicted transcript
  vs. ASR on final waveform
  vs. speaker and channel probes

The exact probe changes by model, but the principle holds. Once the system has separate carriers, measure whether they still agree.

Prompt Audio Is An Overloaded Bus

Voice cloning makes this harder.

Prompt audio is rarely just "the speaker." It also contains microphone color, room tone, reverberation, speaking rate, emotion, accent, breath noise, recording quality, and background texture. A zero-shot TTS system can copy too much, too little, or the wrong part of that bundle.

This is visible across many families. VALL-E-style systems condition on prompt codec tokens. SPEAR-TTS and Make-A-Voice pass prompt tokens through semantic and acoustic stages. VoXtream2 combines Mimi prompt tokens with a separate speaker embedding. OmniVoice exposes denoising and voice-design controls in its release interface. VIBEVOICE separates acoustic and semantic views of generated history.

The builder question is not just:

Does the cloned voice sound similar?

It is:

Which signal did the prompt carry?
identity, prosody, style, accent, room, mic, noise, rate, or all of them?

A simple factorial prompt test can reveal the difference:

Prompt pair What it tests
Same speaker, different rooms Whether room tone leaks into identity
Different speakers, same room Whether channel color dominates similarity
Same speaker, different emotion Whether style transfer is controlled or accidental
Clean prompt vs. reverberant prompt Whether denoising changes identity metrics
Same words, different speaking rate Whether prompt rate controls output rate

This is especially important for evaluation. Speaker similarity metrics can reward prompt-channel artifacts. A system that strips reverb may sound cleaner but score worse if the metric treats the recording condition as part of the speaker match.

Continuous Latents Return

Discrete codec tokens were a major step, but they are not the end of the story.

Continuous-latent systems try a different tradeoff. Instead of forcing speech through discrete IDs, they generate continuous latent vectors from an audio autoencoder, VAE, or codec-like representation.

NaturalSpeech 2 uses continuous codec latents with diffusion. MegaTTS 3 uses WaveVAE latents with rectified flow. VIBEVOICE uses next-token diffusion over continuous acoustic latents for long-form podcast-like generation. Pocket TTS / CALM uses continuous VAE-GAN latents. TADA uses continuous text-aligned acoustic vectors. Audio-Omni uses a pretrained VAE/oobleck-style autoencoder with rectified flow.

The attraction is clear. A continuous target avoids some codebook-utilization and residual-depth problems. It can also make long-form low-rate state more natural, because the model is not forced to choose from a fixed acoustic vocabulary at every step.

The cost is also clear. A continuous latent is not generated by a normal softmax. The model must pay through some other mechanism:

  • diffusion steps
  • flow or ODE solver steps
  • an implicit distribution head
  • a continuous autoregressive sampler
  • a latent decoder that may or may not be causal

Continuous latents remove one bottleneck and introduce another.

The Decision Budget

Many papers advertise a low frame rate. That number matters, but it is not the whole compute story.

A 12.5 Hz codec sounds cheap until each frame carries 16 or 32 residual codebook decisions. A non-autoregressive model sounds cheap until it needs many flow or diffusion steps. A semantic/acoustic cascade sounds clean until the system runs a semantic model, an acoustic model, and a decoder. A streaming system sounds fast until chunk overlap, lookahead, and decoder buffering are counted.

Token budget illusion

Original generated overview figure. The same acoustic-detail work can be paid as frame rate, codebook depth, local residual loops, cascade passes, solver steps, or decoder cost.

A better comparison is a decision ledger.

For each system, record:

frame rate
codebooks or streams per frame
local residual-depth passes
semantic/acoustic cascade passes
diffusion or flow steps
decoder or vocoder latency
chunk overlap and lookahead
hardware and batching

Then compare WER, speaker similarity, naturalness, latency, and long-form drift against that ledger. The point is not to reduce everything to one scalar. The point is to avoid pretending that low Hz alone means low work.

MOSS-TTS is the cleanest example. It operates at a low 12.5 frames per second, but one frame can include a deep RVQ stack. The design question becomes how to model residual depth: delay pattern, local transformer, or another allocation of current-frame and long-context state.

Voxtral TTS creates a different ledger: one semantic token plus many acoustic FSQ values, with flow matching used for the acoustic part. Continuous-latent systems move cost into sampler steps and decoder calibration. Flow-matching mel systems move cost into NFE and vocoding.

Work moves. It does not disappear.

Streaming Is A Whole Pipeline Property

Autoregression does not automatically mean streaming. A model can generate tokens left to right and still require the full text, a non-causal duration plan, a whole-utterance acoustic sampler, a non-causal decoder, or a large buffer before audio can start.

Streaming TTS is an end-to-end causality problem.

Streaming causality stack

Original generated overview figure. A streaming claim should name the bounded lookahead and timing contribution of text arrival, planning, acoustic-unit generation, decoding, chunking, and first-audio delivery.

A useful streaming report separates:

text wait time
normalization and phoneme lookahead
duration or semantic planning
acoustic unit generation
codec or vocoder decoding
chunk scheduling
network/client buffering

That gives two different measurements:

  • TTFC or TTFA: how long until the first audible chunk is ready.
  • RTF: steady-state wall-clock time per second of generated audio.

Those numbers are not interchangeable. A system can have good RTF but poor time-to-first-audio if it waits for a full sentence. Another system can produce quick first audio but accumulate prosody debt, boundary artifacts, or correction problems when late punctuation or new text changes the intended phrasing.

This is why systems such as LLMVoX, Qwen3-TTS, GPA, VoXtream, VoXtream2, Voxtral TTS, Kyutai DSM TTS, and Fish Audio S2 are interesting beyond their model labels. They try to make more of the pipeline causal or chunk-safe. The right comparison is not "AR vs NAR." It is where the non-causal boundary sits.

The Clock Is A Model Component

The subproblem that keeps reappearing is the clock.

Text arrives as symbols. Speech leaves as time. Somewhere in the system, text positions have to become acoustic time positions.

Older mel systems expose this directly through alignments, duration predictors, or length regulators. Codec LMs can hide it inside token generation, but then they can skip, repeat, or drift. Masked systems often need a target length. Streaming systems need partial clocks. Text-aligned latent systems make the clock explicit again.

This makes duration and progress tracking a first-class subsystem.

A good duration-stress evaluation would include:

  • fast and slow style prompts
  • heavy punctuation
  • long clauses
  • code switching
  • numbers, acronyms, and names
  • late-arriving text in streaming mode
  • repeated phrases designed to trigger loops

Then measure:

  • duration error
  • skipped and repeated words
  • pause placement
  • speaking-rate drift
  • alignment entropy
  • local WER around punctuation and clause boundaries

This is not only an evaluation concern. It changes API design. A system that has a native duration or speaking-rate control can expose it honestly. A system that only has sampler temperature should not pretend the same control exists.

The Codec Oracle Gap

When a TTS model uses a codec, tokenizer, vocoder, or audio autoencoder, the first evaluation row should be the oracle reconstruction path:

ground truth audio -> tokenizer / encoder -> decoder -> reconstructed audio

Then evaluate the generated path:

text -> generator -> generated units -> decoder -> generated audio

The difference between those rows is the gap the generator is responsible for. Without the oracle row, a TTS result can be a tokenizer result, a decoder result, a sample-rate result, or a data-filtering result in disguise.

Component-swap baseline

Original generated overview figure. The codec-oracle path and component-swap matrix help attribute gains to the tokenizer, generator, or decoder instead of crediting the whole stack to one architecture label.

The same logic applies to cross-model claims. If one paper changes the tokenizer, generator, data, decoder, sample rate, prompt policy, and evaluation set, the result is a system recipe. It may be a good system recipe, but it is not clean evidence that one generator family beat another.

The cleaner artifact is a component-swap matrix:

Hold constant Swap What it tells you
Same tokenizer and decoder Generator Whether AR, masked, flow, diffusion, or continuous heads help on the same units
Same generator and decoder Tokenizer Whether the representation improved the task
Same generated units Decoder Whether waveform quality changed below the generator
Same eval and data recipe Inference policy Whether sampling, guidance, chunking, or filtering caused the gain

Few papers expose enough to run the whole matrix. That is exactly why the question is useful. It prevents architecture names from carrying more evidence than the experiment supports.

How To Choose A Family

No family wins every setting.

For a controllable paired-TTS product, acoustic/vocoder or flow-mel systems are still attractive. They have clear timing machinery and a familiar waveform path. They are easier to debug when pronunciation, duration, or prosody is the main problem.

For zero-shot voice cloning with large-scale pretraining, codec LMs are natural. They turn speech generation into token modeling and make prompt audio easy to include. The risk is that the codec interface, not just the LM, defines quality and latency.

For expressive cloning, multilingual generation, style control, and dialogue, semantic/acoustic cascades become attractive. They give separate places for content and acoustic detail. The risk is disagreement between stages and harder evaluation.

For long-form speech, podcast generation, and systems where discrete codebook depth becomes awkward, continuous latents are a serious branch. They can carry compact state through time, but they need careful sampler and decoder instrumentation.

For streaming, ignore the family label until the pipeline is traced. The system needs bounded lookahead from text arrival to first audio. A non-causal stage anywhere in the path can block streaming.

Practical Comparison Checklist

When comparing two TTS papers, ask these questions before looking at MOS:

  1. What object does the generator predict?
  2. Where does the clock live?
  3. Is prompt audio an identity signal, a style signal, a channel signal, or all of them?
  4. Is the tokenizer reused, frozen, trained by the paper, or ambiguous?
  5. What is the codec-oracle reconstruction floor?
  6. How many acoustic decisions are made per second after codebooks, solver steps, cascades, and decoders are counted?
  7. Which parts of the pipeline are causal or chunk-safe?
  8. Are baselines matched for data, tokenizer, decoder, sample rate, and inference policy?
  9. What happens on long-form boundary tests?
  10. Which control knobs are native, and which are only prompt-implied?

This checklist is often more useful than another architecture diagram. It turns the paper from a name into an engineering contract.

Technical Takeaways

1. Interface contracts beat generator labels

The most useful first question is not whether a model is AR, NAR, diffusion, or flow. It is what crosses the generator boundary.

The generated object defines the rest of the stack: duration machinery for mels, codebook depth for codec tokens, agreement checks for semantic/acoustic cascades, and sampler or decoder debt for continuous latents. Two systems with the same backbone label can have different failure modes because their interfaces are different.

The builder diagnostic is a component ledger: prediction target, frame or latent rate, streams per frame, waveform path, and which module owns speaker, prosody, and duration.

2. Every TTS system needs a clock

Speech is text placed onto time. Systems differ in where they build that clock: duration predictors, alignment models, target token counts, AR token progress, masked target lengths, duration-state tokens, text-aligned latents, or chunk schedules.

This is not a minor inference detail. Clock errors become skipped words, repetitions, awkward pauses, speaking-rate drift, and streaming instability. The useful test is a duration-stress suite with punctuation, long clauses, rate control, repeated phrases, and partial text arrival.

3. Report the oracle gap and the decision budget

Codec and latent systems need two extra accounting rows.

First, report the codec-oracle reconstruction floor:

reference audio -> tokenizer / decoder -> reconstructed audio

Second, report the decision budget:

frame rate * streams * local passes * solver steps + decoder cost

Those rows stop the post-hoc story from crediting the wrong component. A model may win because the generator is better, because the tokenizer is better, because the decoder is better, because the data is larger, or because the eval setup changed. The architecture claim should only be as strong as the controlled comparison.

The Shape Of The Stack

The modern TTS stack is less chaotic than the model list suggests.

Most systems are arranging the same responsibilities: text content, time, speaker identity, prosody, acoustic detail, channel texture, long-form state, and waveform reconstruction. The differences come from where each responsibility lives.

Mels keep the acoustic target visible. Codec LMs turn speech into language modeling over audio codes. Semantic/acoustic cascades split content from detail. Continuous latents trade discrete codebooks for samplers and latent decoders. Streaming systems force the whole chain to respect bounded lookahead.

That is the practical map. Choose the interface first, then judge the model.

Sources