Diffio AI Blog

TTS Models: Sampled Acoustic Representations Set the Generator Constraints

A map of modern text-to-speech systems by the representation their generators sample: codec tokens, mel frames, semantic-acoustic cascades, continuous latents, and hybrid frames.

Latent Audio Encoders: Reconstruction Sets the Ceiling, Not the Ranking

A continuous-latent audio autoencoder compresses waveform into a sequence of real-valued vectors that a downstream generator, a flow-matching DiT or an autoregressive LM with a per-token diffusion head, learns to predict from text or context; the decoder then turns predicted latents back into audio. That deployment

SAME Makes Generation Difficulty an Autoencoder Training Loss

SAME (Semantically-Aligned Music autoEncoder) is Stability AI's autoencoder for 44.1 kHz stereo music and general audio, and the frozen latent space that Stable Audio 3, the company's text-to-audio generator family, generates in. It compresses the waveform 4096x along the time axis into

Flow Matching: Regress the Field the Sampler Integrates

Flow matching trains a continuous normalizing flow: a generative model that parameterizes a time-dependent vector field v_theta(x, t) with a neural network and produces samples by drawing noise x_0 from N(0, I) and integrating the ODE dx/dt = v_theta(x, t) from t = 0

Mega-ASR: Group-Relative RL Needs a Reward That Ranks Failed Transcripts

Mega-ASR is a robustness recipe for Qwen3-ASR-1.7B, an audio-LLM ASR system in which an acoustic encoder feeds an aligner (an adapter that maps encoder outputs into the LLM's embedding space) and the LLM decodes the transcript. The recipe has four parts: Voices-in-

GRPO and DAPO: Within-Group Reward Dispersion Gates the Learning Signal

Group Relative Policy Optimization (GRPO) is the RL algorithm introduced in DeepSeek's DeepSeekMath paper and now the default recipe for post-training LLMs against verifiable rewards. It keeps PPO's clipped policy-gradient surrogate and deletes the learned value-function critic; nothing replaces the critic except the

MOSS-TTS Shows Local RVQ Conditioning Can Beat Delay Modeling

MOSS-TTS is a fully discrete text-to-speech system: text and prompt audio are serialized into language-model inputs, the model predicts audio codec tokens autoregressively, and a neural codec decoder turns those tokens back into waveform. In the Diffio map of modern TTS systems, that places it in

VibeVoice: Frame Rate Is the Context Budget for Long-Form TTS

VibeVoice is Microsoft Research's open-weights system for zero-shot, multi-speaker, long-form speech synthesis: give it a short voice sample per speaker and a script, and a Qwen2.5 backbone generates podcast-style conversation, turn-taking and breaths included, in one streaming left-to-right pass.

Exposure Bias: Corrupt the Context, Keep the Target Correct

Exposure bias is the train/inference mismatch built into almost every autoregressive generative model. Training by teacher forcing evaluates each next-step prediction against a ground-truth history; inference feeds the model its own sampled outputs as history. Every per-step error therefore lands somewhere the training loss never measured,

Classifier-Free Guidance: The Guidance Scale Is an Exponent on an Implicit Classifier

Classifier-free guidance (CFG) trains a conditional generative model with condition dropout, so one network can predict both with and without the conditioning signal. At sampling time the two predictions are combined by extrapolating along their difference, scaled by a guidance weight. Nearly every modern conditional generator exposes that weight:

Audio Restoration

Latest