VibeVoice TTS: Next-Token Diffusion

Share
VibeVoice TTS: Next-Token Diffusion

Suppose we only want a text-to-speech system to speak one sentence.

It needs to know the words. It needs a speaker voice. It needs to turn those things into audio. This is already hard, but the shape of the problem is still pretty small.

Now ask for something more like a podcast. There may be several speakers. The audio may last many minutes. The system has to remember whose turn it is, what has already been said, how each speaker sounds, and whether the generated audio should keep the room tone and microphone color of the prompt.

This is where VibeVoice becomes interesting. Its main idea is not simply "diffusion for speech." It is a way of making a long speech generator keep state.

In the outer loop, VibeVoice still behaves like an autoregressive language model. But instead of choosing the next word from a vocabulary, it asks a diffusion head to sample the next continuous acoustic latent.

ordinary language model:
past text -> transformer -> next discrete token

VibeVoice:
past text + past speech -> transformer -> diffusion head -> next speech latent

That is the short version. The useful version is to look at what has to flow through the loop.

VibeVoice architecture diagram

Source: cropped from the VibeVoice paper, Figure 1, p. 2. The diagram shows voice prompts and text entering the VibeVoice core. The core conditions a diffusion head, the diffusion head predicts acoustic VAE latents, and an acoustic decoder turns those latents into speech.

A Normal Autoregressive Loop

A normal language model is easy to draw.

tokens so far -> model -> next token

Then the next token becomes part of the history.

tokens so far + next token -> model -> next token after that

If we unroll the loop, it looks like a chain. Each step receives a message from the past and passes a new message into the future.

state_0 -> state_1 -> state_2 -> state_3 -> ...

This is the right first mental model for VibeVoice. The system is still moving left to right through time. It is still using a causal model. The important difference is the thing being predicted at each step.

Text tokens are discrete. The model can choose one item from a vocabulary. Speech latents are continuous. There is no single vocabulary item to pick. So VibeVoice uses diffusion as the local prediction head.

history -> LLM hidden state -> diffusion head -> next acoustic latent

The diffusion head is not denoising the whole podcast at once. It is more like replacing the usual softmax head at the end of a language model. The LLM decides what the next piece of speech should be conditioned on, and the diffusion head draws a plausible continuous acoustic latent for that condition.

Unrolling VibeVoice

The loop is easier to understand if we unroll one step.

voice prompt + script + previous generated speech
        |
        v
   VibeVoice core
        |
        v
   diffusion head
        |
        v
 next acoustic latent
        |
        v
 acoustic decoder -> waveform
        |
        v
 semantic encoder -> back into future history

There are three different kinds of information here.

First, the text script says what should be spoken and which speaker should be speaking.

Second, the voice prompt says what the speaker sounds like. But a voice prompt does not contain only identity. It also contains prosody, room tone, reverb, microphone response, background texture, and other recording details.

Third, the generated audio becomes part of the future. VibeVoice does not simply emit audio and forget it. It re-encodes what it generated and feeds that back into the next step.

This feedback path is where much of the architecture starts to make sense. VibeVoice is not just conditioned on the intended script. It is also conditioned on a representation of what it actually produced.

The Tiny Acoustic State

Long audio creates a sequence-length problem.

Raw 24 kHz audio has 24,000 samples per second. That is much too large to carry directly through a language-model context. Neural audio codecs make the sequence shorter, but many still operate at tens or hundreds of frames per second, often with multiple codebooks per frame.

VibeVoice makes a much stronger compression bet. Its acoustic tokenizer runs at 7.5 frames per second. The paper describes this as 3200x downsampling from 24 kHz audio.

That makes the long-form setting much more plausible. Thirty minutes at 7.5 frames per second is about 13,500 acoustic positions. Ninety minutes is about 40,500 acoustic positions, before adding text, prompts, speaker markers, and implementation details.

But a small state creates a new problem.

If the acoustic stream is extremely compact, it cannot cheaply carry everything. It has to preserve enough sound for the decoder. It may also be asked to carry speaker identity, prosody, transcript state, turn-taking, and recording condition.

That is too much responsibility for one narrow channel.

This is the first important way to understand VibeVoice: the 7.5 Hz tokenizer is not just compression. It is a choice about where information is allowed to travel.

Why VibeVoice Uses Two Speech Views

VibeVoice separates two views of speech.

The acoustic tokenizer is trained to preserve sound. It is a continuous sigma-VAE-style tokenizer. This is the path the diffusion head predicts.

The semantic tokenizer is trained with an ASR proxy task. It is meant to capture content and phonetic structure. This is not the thing VibeVoice directly generates. It is feedback that helps the future steps know what the generated audio contained.

So the model has something like this:

acoustic view:  "what did it sound like?"
semantic view:  "what was said?"

This split matters because the ablations show a real tension.

Variant What it asks the representation to do What the paper reports
Acoustic-only Preserve speaker and sound Good speaker similarity, much worse multi-speaker WER
Coupled tokenizer Use one shared latent for reconstruction and ASR Better WER than acoustic-only, but speaker similarity drops hard
Hybrid tokenizer Keep separate acoustic and semantic paths Better balance of intelligibility and speaker identity

On the short VIBEVOICE-Eval set, the acoustic-only 1.5B model at 16K context has overall WER-W 6.22 and SIM-O 0.68. The final hybrid variant at the same size/context has WER-W 1.84 and SIM-O 0.64. The coupled tokenizer improves content relative to acoustic-only, but its overall SIM-O drops to 0.45.

It is tempting to turn this into a simple lesson: two tokenizers beat one.

That is not quite right.

The better lesson is that semantic pressure has to enter somewhere, and that choice changes what the acoustic channel is protected from.

System Where semantic pressure enters What protects acoustic detail
VibeVoice A frozen semantic encoder feeds back generated audio history The diffusion target stays an acoustic sigma-VAE latent
SemaVoice WavLM alignment shapes the VAE latent during tokenizer training The same latent still has to reconstruct waveform
Voxtral An ASR-distilled semantic VQ token is the autoregressive target A separate acoustic FSQ bundle is predicted by a flow head
MOSS-TTS Audio-to-text LLM loss trains RVQ tokens to be more semantic A larger variable-bitrate RVQ stack and reconstruction losses

These systems are not all making the same bet. VibeVoice puts semantic pressure in a feedback path. SemaVoice pushes it into tokenizer training. Voxtral makes the semantic token the autoregressive object and predicts acoustic detail beside it. MOSS-TTS tries to make codec tokens more useful for text generation and reconstruction together.

So the design question is not "semantics or no semantics." It is:

Where should the model carry content?
Where should it carry sound?
What happens when those jobs share the same bottleneck?

The Semantic Encoder Is An Observer In The Loop

Now return to the feedback path.

After VibeVoice generates a speech segment, it runs a semantic encoder over the generated waveform. Then it feeds that semantic representation into future generation.

This makes the semantic encoder a kind of observer inside the generation loop.

If it hears the generated audio correctly, it can help the model stay aligned with the script. If it hears the audio incorrectly, future generation may be conditioned on a mistaken internal state.

This is different from evaluating the final audio with ASR after the fact. The semantic encoder is part of the loop that creates the next audio.

generate speech -> encode generated speech -> generate next speech

That suggests a diagnostic that would be very useful but is not in the paper. For each generated segment, log:

intended script span
expected speaker
external ASR transcript of generated audio
semantic-feedback representation or nearest text proxy
speaker embedding / diarization assignment
next-step WER or local edit distance

Then plot those errors over time.

If semantic-feedback errors appear before content collapse, the system may need a better observer, a confidence gate, a second semantic teacher, or a fallback to explicit text state. If the feedback is mostly tracking where the script is, then explicit progress state might recover part of the gain without asking the semantic encoder to infer everything from audio.

A useful ablation would compare:

  1. VibeVoice hybrid feedback as published.
  2. Acoustic-only generation plus explicit script progress and speaker-turn state.
  3. Acoustic-only generation plus oracle script progress on the eval set.
  4. Hybrid feedback with delayed, dropped, or shuffled semantic feedback.

The point is not that the paper is wrong. The point is that the semantic encoder has two possible roles. It may be capturing rich generated-audio state. It may also be compensating for weak script-progress tracking. Those are different mechanisms, and a builder would debug them differently.

What The Diffusion Head Does

Once the loop is clear, the diffusion part is less mysterious.

The LLM produces a hidden state for the next step. The diffusion head uses that hidden state as conditioning. During inference, it starts from noise and denoises toward the next acoustic latent.

LLM hidden state + noise -> denoising steps -> acoustic latent

Then the acoustic decoder turns that latent into waveform.

This is why "next-token diffusion" is a good name. The model is doing next-token prediction in a causal loop, but the token is a continuous acoustic latent and the prediction head is a small diffusion model.

The low frame rate is what makes this plausible. A 10-step diffusion head would be much more expensive if it ran at a high codec frame rate. At 7.5 Hz, the paper reports real-time factor below 1.0 for the tested configurations on a single NVIDIA A6000, including the 7B model with 10 denoising steps at RTF 0.97.

So diffusion is not magic by itself. It works here because it is paired with a very compact latent stream and a causal model that carries the long-range state.

When Cleaner Audio Is Less Similar

There is another easy trap in VibeVoice: thinking that more denoising must mean better speech.

A voice prompt contains at least two things.

speaker identity
recording atmosphere

Speaker identity is the voice we usually care about. Recording atmosphere is everything around it: room tone, reverb, microphone color, compression, and background texture.

For some applications, atmosphere is noise. For podcast-style generation, it may be part of what makes the generated speech sound like it belongs with the prompt.

VibeVoice's paper reports that WER and SIM-O prefer different CFG/DDPM regions. Figure 6 gives the intuition. With more denoising steps, the sample becomes cleaner, but it can also lose the prompt's environmental texture.

VibeVoice denoising spectrogram comparison

Source: cropped from the VibeVoice paper, Figure 6, p. 22. The spectrograms compare the original voice prompt with samples generated using 10 and 50 diffusion steps. The 50-step sample removes more environmental texture from the prompt.

This changes how we should interpret speaker similarity.

If SIM-O drops after stronger denoising, maybe the model lost the speaker. But maybe it removed the room, microphone, or background texture that the metric was treating as part of the match.

A better evaluation would split these apart:

  • vocal identity after channel normalization
  • channel/style match against the prompt
  • prosody and rhythm similarity
  • cleanliness or perceived audio quality
  • local speaker drift over time

The small experiment is straightforward. Use the same speaker with clean, noisy, and reverberant prompts. Sweep diffusion steps and CFG scale. Compute speaker similarity before and after channel normalization. Then ask human raters to score identity, cleanliness, and prompt-channel match separately.

If the metric changes after normalization, speaker identity was only part of the story. The sampler was also changing the recording match.

What The Results Support

Treat the results as an architecture check, not as a universal leaderboard claim. In the reported setting, a low-rate continuous acoustic stream, semantic feedback, and a local diffusion head are enough to support long-form, multi-speaker podcast-style speech.

The headline numbers fit that story. Table 2 reports VibeVoice-7B at WER-W 0.66 and SIM-O 0.75 on the 0-12 minute VIBEVOICE-Eval subset, and WER-W 1.24 and SIM-O 0.75 on the 12-30 minute subset. Table 5 shows why the hybrid representation matters: acoustic-only keeps speaker similarity but degrades multi-speaker WER, while the coupled tokenizer improves content at a large speaker-similarity cost.

There are still important limits.

The training data and VIBEVOICE-Eval are internal. The paper describes the annotation pipeline, but the corpus and benchmark cannot be fully reproduced from the public PDF alone.

The scale comparison is also not perfectly clean. The 1.5B model trains through the 64K context phase. The 7B model skips that final phase because of resource limits. Parameter count, context length, data, and runtime budget are tangled together.

The duration claim should be handled carefully. The local/arXiv paper discusses up to 90 minutes in a 64K-context setting. The Microsoft Research publication page, checked May 23, 2026, phrases the capability as up to 30 minutes. The densest public evaluation table is for 1-30 minute samples.

Public availability also changed after release. The GitHub README, checked May 23, 2026, says the original VibeVoice-TTS code was removed after release and lists the TTS 1.5B quick try as disabled. That does not change the architecture, but it matters for reproducibility.

Technical Takeaways

1. Semantic pressure placement is the real design axis

VibeVoice does not prove that separate acoustic and semantic tokenizers are always better. It shows that one shared 7.5 Hz latent trained for both waveform reconstruction and ASR lost too much speaker identity.

The broader design question is where semantic pressure should enter: tokenizer training, generated target, side-channel feedback, or explicit script/turn state. The concrete diagnostic is a fixed-cadence semantic-pressure sweep that measures long-form drift, not just WER/SIM averages.

2. The semantic encoder is a closed-loop observer

VibeVoice's semantic encoder encodes generated audio and feeds future generation. That makes it an observer inside the loop. If it observes correctly, it can stabilize the script. If it observes incorrectly, it can amplify its own errors.

The next diagnostic is an observer-residual trace: intended text, external ASR, semantic-feedback proxy, speaker assignment, and next-step local WER over time. That trace would help separate rich audio-state feedback from simple script-progress tracking.

3. Speaker similarity mixes identity with prompt atmosphere

The denoising ablation is a metric warning. More denoising can make speech cleaner while removing room tone, reverb, mic color, and other prompt-channel cues. If SIM-O drops, the model may not have lost the speaker; it may have lost the recording condition that the metric treated as part of the match.

Evaluate identity, channel/style match, prosody, cleanliness, and local drift separately. This is source-backed as a hypothesis from the paper's CFG/DDPM ablation and Figure 6, but the channel-normalized evaluation is still an experiment to run.

The Shape Of The Idea

VibeVoice is a useful example of a broader pattern: autoregressive models do not have to emit only discrete tokens.

If the representation is discrete, a softmax head is natural. If the representation is continuous, a generative head becomes attractive. VibeVoice uses a causal LLM to carry long-range state and a diffusion head to sample the next continuous acoustic latent.

The most durable lesson is about division of labor. Long-form TTS needs compact representations, but compact representations force hard choices about what each channel carries. VibeVoice gives acoustic detail, semantic feedback, dialogue history, and continuous sampling separate places in the system.

Whether those are the best places is still open. But it is exactly the right question to ask.

Sources