MOSS-TTS: Modeling The RVQ Stack
MOSS-TTS turns each speech frame into an ordered RVQ stack. The interesting question is how much compute and memory each residual depth band deserves.
A normal text model has an easy output shape.
history -> Transformer -> next token
Text-to-speech can look similar if the audio has already been turned into tokens. But the word "token" hides a lot of engineering.
In MOSS-TTS, one 80 ms slice of 24 kHz audio is not represented by one token. It is represented by a stack of residual vector-quantizer codes. The tokenizer can emit up to 32 RVQ layers at 12.5 frames per second. Each layer refines what the earlier layers did not capture.
So the generator's next step is not really:
history -> next audio token
It is closer to:
history -> next frame -> 32 ordered residual decisions
That is the part of MOSS-TTS worth slowing down for. The report's headline recipe is simple: discrete audio tokens, autoregressive modeling, and large-scale pretraining. The interesting question is where the model pays for the token stack.
MOSS-TTS gives two answers.
MossTTSDelay:
shift RVQ streams across time
predict all channels from one backbone state
optimize for simpler long-context serving
MossTTSLocal:
keep the frame aligned
run a small local autoregressive loop over the RVQ stack
optimize for stronger speaker preservation
The surprising result is that the smaller 1.7B Local Transformer beats the larger 8B delay-pattern model on most directly comparable voice-cloning metrics, especially speaker similarity. That does not prove that residual-depth causality is the only reason. It does show that the shape of the audio token block is not a minor implementation detail.
The Tokenizer Is The Interface
MOSS-TTS is built on MOSS-Audio-Tokenizer. It compresses 24 kHz audio to 12.5 frames per second. One frame covers 80 ms of waveform.
Each frame can have up to 32 RVQ layers. Each layer uses a 1024-entry codebook, so each codebook index carries 10 bits. At 12.5 frames per second, one RVQ layer contributes:
12.5 frames/s * 10 bits = 125 bits/s
That makes the bitrate ladder easy to reason about.
| RVQ layers | Bitrate |
|---|---|
| 1 | 0.125 kbps |
| 8 | 1.0 kbps |
| 16 | 2.0 kbps |
| 32 | 4.0 kbps |
The low frame rate matters for long context. An audio-only 64k-token window at 12.5 Hz covers about 85 minutes before adding text, prompt audio, speaker markers, and implementation overhead.
But the tokenizer is not a tiny codec bolted onto a language model. The report describes a 1.6B-parameter causal Transformer tokenizer, with a causal encoder, RVQ, causal decoder, adversarial discriminators, and a 0.5B decoder-only LLM semantic head. The semantic head predicts text from quantizer outputs for ASR, multi-speaker ASR, and audio captioning. The acoustic side trains with multi-scale mel reconstruction, commitment and codebook losses, adversarial losses, and feature matching.

Source: cropped from MOSS-TTS Technical Report, Figure 1, p. 5. The diagram shows the causal Transformer encoder, RVQ-32 quantizer, causal decoder, semantic LLM head, and discriminator losses used to train MOSS-Audio-Tokenizer.
This changes the mental model. MOSS-TTS is not just asking a Transformer to learn speech from scratch. It is asking the generator to speak through a very specific interface:
low frame rate
many residual layers
semantic pressure inside the tokenizer
autoregressive generation over the resulting token streams
That interface decides what the generator has to model.
One Speech Frame Is A Stack
Residual vector quantization is ordered.
The first codebook approximates the signal. The second codebook quantizes what is left after the first approximation. The third codebook works on the next residual, and so on.
audio frame
-> codebook 1: coarse approximation
-> codebook 2: residual after codebook 1
-> codebook 3: residual after codebook 2
...
-> codebook 32: fine residual detail
Those codebooks are not exchangeable labels. Later layers depend on the earlier layers in the tokenizer's construction, even if a generator can choose to model them with parallel heads.
In speech, that choice is especially important. Coarse layers can carry much of the content and broad acoustic shape. Later layers can refine timbre, speaker texture, channel color, and other details that speaker-similarity metrics may reward.
So MOSS-TTS has a core factorization problem:
Should the model predict all residual layers in parallel?
Should it generate them in order inside each frame?
Should it use a hybrid?
The paper tests two practical operating points.

Source: cropped from MOSS-TTS Technical Report, Figure 2, p. 8. The left panel shows the delay-pattern architecture. The right panel shows the local transformer architecture.
Delay Turns Depth Into A Pipeline
The delay-pattern model is the simpler long-context design.
The tokenizer produces an audio token matrix:
RVQ layer 1: a_1,1 a_1,2 a_1,3 ...
RVQ layer 2: a_2,1 a_2,2 a_2,3 ...
RVQ layer 3: a_3,1 a_3,2 a_3,3 ...
...
RVQ layer 32: a_32,1 a_32,2 a_32,3 ...
One brute-force autoregressive model could flatten this into T * 32 audio positions. That would make the sequence much longer.
MossTTSDelay uses a delay pattern instead. RVQ layer j is shifted forward by j - 1 frames.
layer 1: frame1 frame2 frame3 frame4 ...
layer 2: frame1 frame2 frame3 ...
layer 3: frame1 frame2 ...
...
layer 32: frame1 ...
At each delayed time step, the model sums codebook-specific embeddings into one backbone input vector. The Transformer produces one hidden state. Output heads predict the text-or-pad channel and all 32 audio channels.
This keeps the audio sequence length at roughly:
T + 31
instead of:
T * 32
That is why the delay pattern is attractive for long outputs and optimized serving. It turns residual depth into a diagonal pipeline.
But one delay step is not one completed audio frame. Frame 1 becomes complete only after all delayed RVQ layers for that frame have emerged. At 12.5 frames per second, 31 frame offsets are about 2.48 seconds of token time.
That does not mean the delay model cannot condition later residual layers on earlier ones at all. The lower layers eventually enter delayed history. The point is narrower: current-frame residual structure is mediated through time shifts and a summed backbone state, rather than exposed as an explicit within-frame chain.
This is the tradeoff:
shorter sequence and simpler head path
vs.
less direct modeling of the residual stack inside the current frame
Local Makes The Current Stack Explicit
MossTTSLocal makes a different bet.
It keeps the frame aligned. The temporal backbone produces one global latent for the next aligned step. Then a lightweight Local Transformer expands that latent into the full token block:
backbone state
-> text-or-pad channel
-> RVQ layer 1
-> RVQ layer 2, conditioned on layer 1
-> RVQ layer 3, conditioned on layers 1-2
...
-> RVQ layer 32, conditioned on layers 1-31
The training objective is the important part:
p(y_j,t | text, previous frames, previous channels in the current frame)
This matches the natural residual order of RVQ more closely than a set of parallel heads. The model sees the lower-depth decisions before producing the higher-depth decisions for the same frame.
The cost is a small autoregressive loop inside each frame. The local module has to emit 33 channels per aligned step. That is more complex than projecting one backbone state through many heads.
It also does not fix every compression boundary. Both Delay and Local still fold past RVQ channels back into a summed embedding before the temporal backbone processes them. Local exposes the current output stack. The long-range history is still compressed into one vector per frame.
That distinction matters because it keeps the interpretation honest. Local is not "full uncompressed RVQ history." It is "explicit residual-depth generation for the current frame."
What The Architecture Comparison Supports
The strongest evidence for the local design is not a standalone score. It is the scale inversion.
On Seed-TTS-eval, the report compares the 8B delay model and the 1.7B Local Transformer in Clone and Continuation modes. Local wins all four speaker-similarity cells. In Clone mode, it also has lower English WER and Chinese CER. In Continuation mode, the larger delay model has slightly better WER/CER, but Local still has higher speaker similarity.
The most striking table is the CV3-Eval multilingual voice cloning subset. Across the listed languages and both Clone and Continuation modes, the Local Transformer has lower CER/WER than the delay model in every directly comparable cell: Chinese, English, Japanese, Korean, German, Spanish, French, Italian, and Russian.
Here is the compact interpretation:
| Result pattern | What it suggests |
|---|---|
| 1.7B Local beats 8B Delay on speaker similarity | The output factorization matters for timbre preservation. |
| Local wins multilingual cloning error cells | The gain is not limited to one language pair. |
| Delay is used for duration, pronunciation, and ultra-long evaluations | The paper does not prove Local is better for every capability. |
| Local has extra local capacity and no delay offset | Residual-depth causality is plausible, not fully isolated. |
The last row is important. The paper does not run the clean ablation that would prove natural RVQ order is the active ingredient.
A useful reproduction would hold the tokenizer, temporal backbone, data, and loss weights fixed, then compare:
parallel heads
natural-order causal local head
permuted-order causal local head
same-parameter non-causal local head
delay pattern with boundary warmup
The measurements should include per-layer negative log likelihood, WER/CER, speaker similarity, time to first audio, real-time factor, and rolling speaker similarity from the beginning of each sample. If a same-parameter non-causal local head matches Local, the residual-order story weakens. If natural RVQ order wins after controlling for capacity and boundary effects, the design lesson becomes much stronger.
The supported claim is narrower:
MOSS-TTS provides strong evidence that the way a generator models RVQ depth
matters. The comparison still mixes residual order, local-head capacity,
delay-boundary effects, and prompt/metric interactions.
The Comparison To VibeVoice And Voxtral
This is also a useful way to compare recent TTS systems without collapsing everything into "discrete vs continuous."
The shared question is:
What does one autoregressive step own?
| System | One step owns | Local generative work | Future context |
|---|---|---|---|
| VibeVoice | A continuous acoustic latent at 7.5 Hz | Diffusion head samples the next acoustic latent | Generated audio is re-encoded through a semantic feedback path |
| Voxtral TTS | One semantic VQ token at 12.5 Hz | Flow head predicts 36 acoustic FSQ values for the same frame | Semantic token plus quantized acoustic values |
| MOSS-TTS Delay | A delayed 33-channel RVQ/text block | Parallel heads from one backbone state | Delayed RVQ streams summed into future backbone inputs |
| MOSS-TTS Local | One aligned frame expanded into 33 channels | Local autoregressive loop over text/pad and RVQ layers | Past RVQ layers summed into future backbone inputs |
VibeVoice keeps the causal language-model loop but predicts continuous acoustic latents with a diffusion head. Voxtral predicts a semantic token autoregressively and handles acoustic detail through a flow-matching head over finite-scalar-quantized coordinates. MOSS keeps the target fully discrete and asks how to model an ordered RVQ stack.
So the real design variable is not simply:
discrete tokens vs diffusion
It is:
frame rate
AR target
local prediction head
write-back state
cacheable serving state
metric failure surface
That boundary decides what errors accumulate. VibeVoice has a semantic observer inside the loop. Voxtral writes quantized semantic and acoustic frame tokens back into context. MOSS writes a full RVQ stack back through summed embeddings.
Those are different contracts.
The Bitrate Ladder Is A Probe
Variable bitrate is easy to treat as a codec feature. In MOSS-TTS, it is more interesting than that.
The tokenizer is trained with random quantizer dropout. The TTS generator also uses progressive sequence dropout, and its loss weights are not flat across the RVQ stack. The first three RVQ layers get weight 3. The next three get weight 2. The remaining layers get weight 1.
That makes the early layers a contract. They are the part of the stack that the system most clearly asks to be reliable under truncation, dropout, and generation loss.
This matters because the tokenizer is also semantically supervised. A 0.5B decoder-only LLM predicts text from the quantizer output for ASR, multi-speaker ASR, and audio captioning. In the tokenizer objective, the semantic loss weight is 20, while the reconstruction loss weight is 15.
So MOSS is making a risky and useful bet:
make one RVQ acoustic stack text-readable
but leave enough residual depth for speaker, prosody, and channel detail
That is different from Voxtral, which gives semantics a dedicated VQ token and pushes acoustic detail into a separate flow-predicted FSQ bundle. It is also different from VibeVoice, where a coupled reconstruction/ASR latent hurt speaker similarity in the paper's ablation and the final system uses a separate semantic feedback path.
The right diagnostic is not just a bitrate-quality curve. It is a layer map.
For each prefix 1..K, measure:
ASR / text recoverability
speaker identity
channel or room match
prosody and rhythm
reconstruction quality
generated-token NLL by layer
Then probe residual shells:
layers 1..3
layers 4..6
layers 7..16
layers 17..32
If text saturates early but speaker identity keeps improving in later shells, then the early RVQ layers are acting as a semantic/acoustic scaffold and the later layers are an acoustic escape hatch. If ASR needs many layers, semantic pressure is spread across the whole stack. If speaker identity drops when only late layers are corrupted, the model should spend more local modeling capacity or anchoring on those layers.
This is why the 0.125-4 kbps ladder is a useful scientific instrument. It can show what the representation is actually doing, not just how nice the reconstruction sounds.
The Data Pipeline Makes Controls Native
The architecture is only half of the recipe. MOSS-TTS also spends a lot of engineering on data.
The report starts from open-domain recordings: podcasts, audiobooks, broadcast/news, film/drama, commentary, and online content. Those are not directly useful as TTS supervision. They can contain multiple speakers, background music, noise, bad transcripts, missing transcripts, and segments that are too long, too short, or mismatched to text.
The pipeline turns that into training pairs through four broad stages.
raw audio
-> denoise, standardize, normalize
-> diarize and consolidate single-speaker segments
-> ASR, rule filters, LLM transcript cleanup
-> acoustic, language, and duration consistency filters
-> targeted data synthesis for controls
The filtering is strict in ways that matter for generation. A refined transcript must contain only one speaker tag. Audio language from Whisper large-v3 must match text-side language predicted from the transcript. Character rate has to fall inside a language-specific duration interval, which catches cases where the transcript is too short for the audio or too long for the audio.
Then the pipeline adds data that the natural corpus would not supply well.
For voice cloning, each target segment is paired with a same-speaker prompt crop. The prompt crop is selected from other segments by WavLM-Large speaker embedding similarity, with random crops capped at 30 seconds.
For noisy user input, text is augmented with punctuation noise, whitespace artifacts, punctuation dropout, and sparse dirty-character injection.
For pronunciation control, text is replaced partly or fully with pinyin for Chinese and IPA for English, while the audio remains unchanged.
For short inputs, the corpus is supplemented with dictionary-style single-word or single-character utterances.
Duration control is the cleanest example of control as serialization. Every training asset is serialized twice:
duration-conditioned variant:
tokens = target audio-token count
free-duration variant:
tokens = None
Because the tokenizer runs at 12.5 frames per second, duration can be expressed as a token count:
target seconds = target token count / 12.5
Table 5 reports overall absolute relative duration error of about 0.7% for both Chinese and English, without a separate duration-control fine-tuning stage.
That is a useful design lesson. The duration feature is not a late control module. It is a field in the pretraining format, repeated across the corpus until explicit and implicit duration become normal modes of the same model.
The Long-Form Bottleneck Is Drift
MOSS-TTS reports an internal ultra-long evaluation set for Chinese and English. The setup covers six text-length buckets per language, 10 prompts per bucket, and both Clone and Continuation modes. The report uses MOSS-Transcribe-Diarize for CER/WER and averages speaker similarity over non-overlapping 3-second windows.
The headline is that the system can operate over very long generations. The useful detail is where it weakens.

Source: cropped from MOSS-TTS Technical Report, Figure 6, p. 21. The plots show speaker-similarity trajectories over elapsed generation time for Chinese and English, in Clone and Continuation modes.
Chinese stays relatively usable in the long buckets. In the 10000+ bucket, Continuation reports CER 1.86 and SIM 63.0. Clone reports CER 3.41 and SIM 60.1.
English degrades harder at the longest horizon. In the 50000+ bucket, Clone reports WER 17.49 and SIM 44.4. Continuation reports WER 29.52 and SIM 51.2.
The report's own interpretation is that cumulative speaker drift becomes the dominant bottleneck, especially in English. The stronger MOSS-specific question is where that drift lives in the RVQ stack.
The stack gives several possible failure sites:
| Possible site | What might drift |
|---|---|
| Early RVQ layers | coarse phonetic and speaker scaffold |
| Middle layers | vocal identity, articulation, timbre |
| Late layers | channel texture, room tone, fine acoustic detail |
| Cross-layer relation | consistency between coarse and fine residuals |
That suggests a more useful long-form diagnostic than one waveform-level SIM curve. During long generation, decode RVQ prefixes over time and run the same ASR/SIM/prosody probes used for the bitrate ladder. Then do token transplants: keep generated lower layers and replace upper layers from a prompt-derived prototype, then invert the intervention.
If replacing one depth band restores SIM without hurting WER, the fix is not "make the context longer." It is a layer-specific memory or anchoring problem.
This also suggests a serving and training policy: full residual detail may not need to stay in long-range memory forever. A model could keep all 32 layers for recent audio, but carry only the first K layers for older history, then use a local acoustic module to close the fine detail near the present.
That is a different interpretation of long context:
first question: how many tokens fit?
better question: which depth bands deserve memory at which time scales?
MOSS-TTS does not test that policy. Its variable-bitrate tokenizer makes the policy testable.
What The Results Prove
The MOSS-TTS report supports several concrete claims.
First, MOSS-Audio-Tokenizer is a strong low-rate speech tokenizer in the reported objective comparisons. At 1000 bps, it reports English/Chinese SIM 0.88/0.81, STOI 0.94/0.91, PESQ-NB 3.38/2.96, and PESQ-WB 2.87/2.43. At 2000 bps, it reports SIM 0.95/0.89 and PESQ-WB 3.41/2.96. At 4000 bps, it reports SIM 0.97/0.93 and PESQ-WB 3.69/3.30.
Second, the two generator architectures expose a real tradeoff. Delay is the simpler single-backbone long-context/control design. Local is smaller but stronger on the reported cloning metrics.
Third, the variable-bitrate stack is not just a compression knob. The report's dropout, early-layer loss weights, and semantic tokenizer objective make RVQ depth a natural place to probe where text, identity, and acoustic detail live.
Fourth, token-count duration control can be learned as part of pretraining when duration is made a normal serialized field. That is useful, but it is a supporting mechanism rather than the deepest lesson in the report.
What This Does Not Prove
The report is useful, but it is not a full reproduction package.
The TTS data is not public at paper scale. Several evaluation sets are internal. Some components, such as MOSS-Transcribe-Diarize and LLM-based transcript filtering, are not fully specified in the report.
Baseline comparisons are uneven because many numbers are copied from other technical reports. Prompt format, ASR backend, sample rate, speaker-similarity model, decoding settings, and model access can all differ.
The Local Transformer advantage is not fully isolated. Local changes residual ordering, local capacity, delay behavior, and time-to-first-audio behavior at once. Residual-depth modeling is an evidence-backed interpretation of the results, not an isolated causal result.
Speaker similarity also needs care. The MOSS pipeline selects voice-cloning prompt crops with WavLM-Large speaker-embedding similarity, and evaluation also uses embedding-based SIM. That can be the right metric for timbre preservation, but it can also reward prompt-channel artifacts or miss prosody and local drift. The safest evaluation splits identity, channel/style match, prosody, prompt overlap, and lexical correctness.
The cloning data construction has a related open question. Speaker labels are recording-local, and prompt crops are selected from other same-speaker segments inside the same recording. That may teach abstract speaker identity, but it may also teach same-session transfer: speaker, room, microphone, style, and channel all bundled together. A sharper evaluation would split same-speaker same-recording, same-speaker different-recording, same-recording different-speaker, and channel-augmented prompt pairs.
Finally, current public availability has moved beyond the March 2026 report. Checked May 25, 2026, the GitHub repository presents a broader MOSS-TTS family: MOSS-TTS, MOSS-TTSD, MOSS-VoiceGenerator, MOSS-SoundEffect, MOSS-TTS-Realtime, and MOSS-TTS-Nano. The Hugging Face collection lists the main 8B MOSS-TTS model, Local Transformer, Realtime, Nano 100M, tokenizer, GGUF, and ONNX-related artifacts. That is good for access, but the paper's core evidence still applies most directly to the MOSS-TTS and MOSS-TTS-Local-Transformer comparison it reports.
Technical Takeaways
1. The bitrate ladder is a layer microscope
MOSS's variable-bitrate RVQ stack should be used to locate information, not just to report reconstruction quality. The first six layers are privileged by dropout and loss weighting, while the tokenizer's audio-to-text semantic head pushes the quantized representation to be text-readable.
The useful experiment is a prefix-and-shell map: decode or probe layers 1..K, then measure ASR, speaker identity, channel match, prosody, reconstruction, and per-layer NLL. If content saturates in early layers while speaker and channel keep improving later, the stack has a semantic scaffold plus acoustic escape hatch. If content needs late layers, semantic pressure is spread across the whole residual ladder.
2. RVQ depth is a state budget, not just an output order
The Local Transformer result and the long-context question are two versions of the same design problem: how much state should each RVQ depth band get, and at what time scale?
Inside one frame, the temporal backbone should carry frame-level intent while a local output module resolves the 33-channel block where text/pad, coarse residuals, and fine residuals interact. Across a long generation, old history may not need all 32 layers forever. A model could keep full residual detail near the present, keep only coarse or semantic layers farther back, and let the local module close fine acoustic detail for the next frame.
That is sharper than "RVQ depth matters." The output-head test is to compare parallel heads, boundary-first heads, same-capacity non-causal heads, permuted-order local heads, and natural-order local heads under the same backbone. The memory test is history-depth dropout: keep 32 layers for recent audio, keep only the first K layers past a time horizon, and measure WER, SIM, drift, KV cost, and latency.
The Shape Of The Idea
MOSS-TTS is a useful counterpoint to newer continuous-head TTS systems. It does not need a diffusion or flow head to make speech generation scale. It leans hard into the discrete-token language-model recipe.
But "discrete tokens" is not the whole explanation. The tokenizer creates a structured object: low-rate frames with ordered residual depth. Once that object exists, the generator has to decide how to carry it through time, how to complete one frame, how to expose it to future context, and how to serve it without waiting too long for a decodable frame.
That is why the 1.7B Local Transformer result matters. It points at a design variable that is easy to miss: the right state budget across the current frame and the long-range history can compete with more backbone scale when the representation has strong internal structure.
The practical implication is to debug MOSS-like systems by depth instead of whole-waveform metrics alone. Keep the tokenizer fixed, control output-head capacity, shuffle the RVQ order, plot per-layer loss, and track speaker and lexical drift by depth over time. Those checks separate output-head capacity from the stronger claim that residual order itself is carrying the gain.
MOSS-TTS makes the audio tokenizer visible as part of the generator's state contract. The RVQ stack is not a passive compression artifact. It is the object the model has to allocate compute and memory around.
Sources
- MOSS-TTS Technical Report: arXiv 2603.18090
- MOSS-TTS GitHub repository: OpenMOSS/MOSS-TTS, checked May 25, 2026.
- MOSS-TTS Hugging Face collection: OpenMOSS-Team/moss-tts, checked May 25, 2026.
- MOSS-TTS Hugging Face Space: OpenMOSS-Team/MOSS-TTS, checked May 25, 2026.
- MOSS-Audio-Tokenizer Hugging Face model card: OpenMOSS-Team/MOSS-Audio-Tokenizer, checked May 25, 2026.
- Related Diffio post: VibeVoice TTS: Next-Token Diffusion.