Audio - Diffio AI Blog

Audio

Latent Audio Encoders: Reconstruction Sets the Ceiling, Not the Ranking

A continuous-latent audio autoencoder compresses waveform into a sequence of real-valued vectors that a downstream generator, a flow-matching DiT or an autoregressive LM with a per-token diffusion head, learns to predict from text or context; the decoder then turns predicted latents back into audio. That deployment

Audio

SAME Makes Generation Difficulty an Autoencoder Training Loss

SAME (Semantically-Aligned Music autoEncoder) is Stability AI's autoencoder for 44.1 kHz stereo music and general audio, and the frozen latent space that Stable Audio 3, the company's text-to-audio generator family, generates in. It compresses the waveform 4096x along the time axis into

Audio

Mega-ASR: Group-Relative RL Needs a Reward That Ranks Failed Transcripts

Mega-ASR is a robustness recipe for Qwen3-ASR-1.7B, an audio-LLM ASR system in which an acoustic encoder feeds an aligner (an adapter that maps encoder outputs into the LLM's embedding space) and the LLM decodes the transcript. The recipe has four parts: Voices-in-