Why SAME Makes Audio Slower Before It Makes It Smaller
A minute of 44.1 kHz stereo audio is 5,292,000 sample values.
SAME hands Stable Audio 3 about 646 vectors instead.
Not text tokens. Not MIDI. Not a spectrogram. Continuous 256-dimensional audio latents, ticking at about 10.76 frames per second.
one minute stereo waveform: [2, 2,646,000]
one minute SAME latent: [256, 646]
That sounds like a magic trick, but it is more specific than "SAME compresses audio by 4096x." The first thing SAME does is make audio slow. It trades a fast sample clock for a slow latent clock. Only then does it ask whether each slower frame can carry enough information for a decoder and, more importantly, for a diffusion model that has to predict those frames.
This post is about that trick.
audio -> SAME encoder -> latent -> SAME decoder -> audio
That is the normal autoencoder test. It is necessary, but it is not the test Stable Audio 3 really cares about.
text, duration, mask -> diffusion transformer -> predicted SAME latent
predicted SAME latent -> SAME decoder -> audio
The second loop is nastier. The decoder no longer sees perfect latents produced by its own encoder. It sees latents hallucinated by another model. The SAME paper is interesting because it does not treat that as an afterthought. It builds the autoencoder around a latent space that is short enough to model and structured enough to predict.
Start With The Clock
The source waveform runs at 44,100 samples per second per channel. SAME wants a much slower object.
It gets there in two steps:
256-sample patching * TRB stride 16 = 4096x temporal downsampling
At 44.1 kHz:
44,100 / 4096 = 10.7666 latent frames per second
So one SAME frame covers about 92.88 ms of stereo audio.
The post fixture checks the accounting:
uv run python blog/assets/same-audio-encoder/fixtures/rate_width_tradeoff_fixture.py
It prints:
SAME 4096x/256: 10.7666 Hz, 92.88 ms/frame, 646 frames/min, 2756.25 scalars/s
comparison 1024x/64: 43.0664 Hz, 23.22 ms/frame, 2584 frames/min, 2756.25 scalars/s
sequence_length_reduction_vs_1024x64: 4.00x
scalar_rate_ratio_vs_1024x64: 1.00x
That comparison is not claiming that every 1024x/64 autoencoder is equivalent. It is just useful arithmetic. SAME can keep the same scalar rate as a 1024x/64 representation while making the time axis 4x shorter and each frame 4x wider.
That is why the title says slower before smaller. The key move is not only throwing information away. It is changing the shape of the problem a transformer sees:
more channels per frame
many fewer time positions
For a transformer, time positions are expensive. Attention, masks, duration conditioning, inpainting spans, and long-range structure all operate on a sequence. SAME is betting that music can be made easier to generate if the sequence is short and each position is wide.
The rest of the paper is about making that bet work.
The First 256x Is Just A Reshape
SAME starts with a cheap move.
The stereo waveform has shape:
(B, 2, T)
SAME groups non-overlapping chunks of P=256 samples per channel. The patch vector concatenates left and right channel samples:
(B, 2, T) -> (B, 2P, T/P)
No learned filters yet. No bottleneck yet. The model has simply packed short stereo spans into larger vectors. Decoding applies the inverse reshape.
This is worth saying because it prevents a common misread of the architecture. The first 256x reduction is not where SAME decides what music means. It is the setup that turns waveform samples into a sequence of patch embeddings a transformer can resample.
The TRB Query Token Decides What Survives
The learned temporal reduction happens in the Transformer Resampling Block, or TRB.
In encoder mode, SAME splits the patch sequence into groups of S embeddings. For SAME-L and SAME-S, S=16. It appends one learned output embedding to each group, runs transformer layers over the interleaved sequence, and keeps only the output embeddings.

Source: cropped from SAME Figure 2, page 3. The paper illustrates encoder-mode TRB interleaving for stride S=2: input embeddings are grouped with query/output tokens, processed by transformer layers, and only output embeddings are extracted.
The toy version looks like this:
x0 x1 ... x15 q -> transformer -> y
where q is a learned query/output token and y is the one latent frame that survives the group.
This is different from ordinary pooling:
pooling:
apply a fixed reduction rule
strided convolution:
learn local filters at fixed offsets
TRB:
insert a learned query token and let attention decide what to retain
In decoder mode the direction reverses. A latent frame is paired with S learned output embeddings. The transformer fills those outputs in, the latent frame is discarded, and the higher-rate patch sequence is unpatched back to waveform.
Every later loss in the paper is downstream of this retention decision. If the query token fails to keep pitch, stereo evidence, transients, or timbre, the decoder and auxiliary losses can only work around that missing information.
The paper does not prove that TRB is better than every alternative resampler. It does give a clean mechanism: SAME's 10.76 Hz frames are learned summary frames, not just pooled waveform chunks.
The Weird Part: Tidier Latents Got Worse
At this point it is tempting to say:
patching + TRB + normalization = good latent
The ablation table says no.
SAME does not use the mainline VAE bottleneck in its final configuration. It uses soft-normalization. The encoder output passes through a learnable per-channel affine transform, is divided by a running standard deviation, and gets a KL-like statistic penalty that pushes time and channel statistics toward zero mean and unit variance. The decoder reverses the normalization. Training also adds Gaussian latent noise so the decoder can tolerate imperfect latents.
That sounds exactly like the kind of cleanup a downstream generator should like.
But in SAME Table 2, soft-normalization alone is worse for generation.
| Config | Downsampling | Width | Bottleneck | Extra pressure | MELlog1p down | FAD-CLAP down | MuQEval up |
|---|---|---|---|---|---|---|---|
| E | 1024x | 64 | VAE | none | 0.098 | 0.724 | 3.194 |
| A | 4096x | 256 | VAE | none | 0.108 | 0.651 | 3.252 |
| B | 4096x | 256 | soft-norm | none | 0.108 | 1.061 | 2.783 |
| C | 4096x | 256 | soft-norm | flow alignment | 0.103 | 0.593 | 3.340 |
| D | 4096x | 256 | soft-norm | flow + semantic + contrastive | 0.109 | 0.576 | 3.870 |
Source: SAME Table 2, page 8. Lower is better for MELlog1p and FAD-CLAP. Higher is better for MuQEval.
Row B is the story. The soft-normalized 4096x latent reconstructs about as well as the 4096x VAE row by MELlog1p. But the downstream diffusion transformer does much worse by FAD-CLAP and MuQEval.
So the latent got numerically tidier, and generation got worse.
That is a much better lesson than "soft-normalization beats VAE." It does not. At least in this ablation, soft-normalization becomes useful only when it is given a generative job.
SAME Trains A Small Generator Inside The Autoencoder
The generative job is L_diff.
During autoencoder training, SAME also trains a small unconditional diffusion transformer on the latent space. It is four layers, 768 hidden dimensions, and uses a flow-matching objective.
During warmup, the small diffusion model sees detached latents. After warmup, gradients flow back into the encoder.
z -> tiny flow model -> can this latent distribution be predicted?
This little model is not Stable Audio 3. It is a training-time probe. It asks whether the current latent space is easy enough for a flow model to learn. If not, the encoder gets pressure to reshape the latent distribution.
That is why row C matters. Adding flow alignment recovers and surpasses the VAE generation metrics. The autoencoder is no longer optimizing only this path:
x -> E(x) -> D(E(x))
It is also optimizing for a downstream model that will have to produce z without seeing x.
A good follow-up experiment would not only report final FAD-CLAP and MuQEval. It would also plot:
flow-probe validation loss
probe convergence speed
downstream DiT convergence speed
decoded noised-latent quality
latent residual spectra
If SAME really made the latent easier to model, that should show up before the final sample-quality score.
"Semantic" Means Chroma, Stereo, And Text-Audio Alignment
The last row in Table 2 adds semantic and contrastive pressure.
Here "semantic" is not a vague promise that the latent understands music. The paper names the signals.
SAME trains lightweight regressors from the latent to:
- chroma features, which capture pitch-class structure;
- interaural level difference, or ILD, which captures stereo image.
It also trains a contrastive critic over:
SAME latent sequence
wavelet audio features
T5Gemma text embedding
The critic learns whether those three views came from the same input. Negatives come from rotating audio and text components within a batch, with masking and volume augmentation to reduce shortcuts.
So the semantic contract is something like:
keep pitch-class structure recoverable
keep stereo image recoverable
keep audio-text alignment visible
still reconstruct waveform detail
Row D improves FAD-CLAP and MuQEval over row C, while MELlog1p slips slightly. That is the right shape of result if the latent is being pushed to preserve music-relevant directions rather than only local reconstruction detail.
It also says how to test failures. Average latent error is not enough. For a predicted latent:
r = z_hat - z
Ask where r points. Does it align with chroma directions? ILD directions? Onset-sensitive directions? Particular channel groups? Decode z + r and compare it with norm-matched random perturbations. A small error in a harmless latent direction may decode as nothing. A small error in a pitch or stereo direction may be musically obvious.
Stable Audio 3 Freezes The Ear
Stable Audio 3 makes the interface role explicit.
The SA3 paper freezes SAME and trains a diffusion transformer over SAME latents. The small SA3 model uses SAME-S; medium and large use SAME-L. The transformer projects SAME's 256-D frames into its model width, prepends memory embeddings, predicts in latent space, then projects back to 256 channels before SAME decodes the result.
In public code, the autoencoder pretransform sits beside the DiT. The wrapper freezes the autoencoder and handles scale conversion around encode/decode. Requested duration becomes a latent length through the downsampling ratio. Inpainting also happens on the SAME clock: reference audio is encoded, the mask is downsampled, and the model receives masked SAME latents plus a mask.
That means SAME is not only a high-compression audio autoencoder. It is the coordinate system that SA3 learns to write in.
The SAME-S / SAME-L split reinforces that. SAME-L is the large quality model: 852M parameters, 1536 transformer width, 12 blocks, sliding-window attention. SAME-S is the deployment model: 108M parameters, 768 width, 6 blocks, chunked attention with midpoint shift.
SAME-S is distilled from SAME-L. It aligns student and teacher latents and also uses cross-decoded paths:
D_S(z_S) student encoder -> student decoder
D_T(z_S) student encoder -> teacher decoder
D_S(z_T) teacher encoder -> student decoder
That is compatibility training. The small model is not merely trying to sound similar. It is trying to speak the same latent language.
What To Test If You Build On This
A normal reconstruction test is still necessary:
z = same.encode(audio)
audio_hat = same.decode(z)
But the generator-facing test is the real one:
z_hat = model(text, duration, mask)
audio_hat = same.decode(z_hat)
For SAME-like latents, I would test four things.
First, log the clock. For every candidate autoencoder, write down:
frames per second
channels per frame
scalars per second
frames per minute
Do not compare "4096x" against "1024x" without also comparing width. SAME's interesting move is the rate/width trade.
Second, test the Table 2 failure mode. A normalized latent is not automatically easy to generate. Track probe loss, downstream generator loss, and decoded noised-latent quality.
Third, test residual directions. Decode structured prediction errors, not only random noise.
Fourth, test SAME-S/SAME-L as an interface:
| Encoder | Decoder | What it tests |
|---|---|---|
E_L |
D_L |
large-model oracle |
E_S |
D_S |
small-model oracle |
E_S |
D_L |
whether small latents decode through the teacher |
E_L |
D_S |
whether the small decoder understands teacher latents |
If the cross paths fail, the student can look fine in direct reconstruction while still breaking cached latents, edit masks, or downstream generators.
Technical Takeaways
SAME's Main Trick Is The Slow Clock
SAME's 4096x/256-D latent is a rate/width trade. At 44.1 kHz it runs at about 10.76 frames per second: roughly 646 frames per minute. Compared with a 1024x/64 representation, it can keep the same scalar rate while giving a transformer 4x fewer time positions.
Builder action: log frame rate, width, scalar rate, and frames per minute before arguing that one audio latent is "more compressed" or "easier to model."
Tidy Latents Are Not Necessarily Predictable Latents
SAME Table 2 is the key evidence. Soft-normalization alone reconstructs about as well as the 4096x VAE row, but downstream generation metrics get worse. Adding the flow-alignment probe recovers those metrics; adding semantic and contrastive pressure improves them further.
Builder action: treat normalization as a starting condition, not proof that a latent is generator-friendly.
SAME Is The Frozen Language Stable Audio 3 Writes In
Stable Audio 3 freezes SAME and predicts SAME latents. Duration, inpainting, model tiers, decoding, and small/large compatibility all operate on this 10.76 Hz latent clock.
Builder action: test predicted-latent decode, residual directions, mask boundaries, and SAME-S/SAME-L cross-decode paths. A clean oracle decode(encode(audio)) loop is not enough.
Sources
- SAME: A Semantically-Aligned Music Autoencoder
- Stable Audio 3
- Stable Audio 3 research post
- Stable Audio 3 public code: model wrapper
- Stable Audio 3 public code: pretransform wrapper
- Stable Audio 3 public code: inpainting path