[Transformers Integration] Understanding Voxtral Realtime architecture for porting

#17

by Seungyoun - opened 11 days ago

11 days ago

Hi Mistral team 👋

I’m interested in contributing a Hugging Face Transformers integration for Voxtral Mini 4B Realtime 2602. After reading the vLLM implementation, here’s my current understanding (see attached GIF):

Audio → mel → Whisper conv + pooling (~80ms / token) → causal / sliding-window audio encoder → adapter → audio_embed
LLM input is element-wise sum (not concat): audio_text_embeds + text_embeds (plus delay/time conditioning)
[STREAMING_PAD] fills the left-pad + initial delay window (e.g., ~480ms ≈ 6 tokens if 80ms/token)
Decoding is AR text-only: only generated text tokens are fed back; audio continues streaming step-by-step

Questions

Is the summary above correct?
Is the causal audio encoder documented anywhere beyond the vLLM code?
For Transformers, would a custom streaming wrapper (step-wise max_new_tokens=1 + incremental audio chunks) be acceptable, or is there a preferred integration pattern?
Are [STREAMING_PAD] + delay/time embedding baked into weights/config, or mostly tokenizer-level handling?
Any plans for a technical paper?

Happy to start a PR once the approach is aligned — thanks!

cc. @patrickvonplaten , @pandora-s , @iliasslasri , @juliendenize , @sebag90 , @sanchit-gandhi

Seungyoun

11 days ago

Oh, it already appeared in transformers https://github.com/huggingface/transformers/pull/43769

patrickvonplaten

Mistral AI_ org 11 days ago

Technical paper will come out as well. Your animation above looks correct and is very nice. Will try to help bring the transformers PR over the line.

sanchit-gandhi

Mistral AI_ org 10 days ago

•

edited 10 days ago

Very nice animation @Seungyoun !

Note that the encoder is a causal audio encoder trained from scratch (whereas Whisper is bi-directional), so new modelling code is required
The time-delay is embedded using a sin/cos embedding, then projected via an MLP and used to modulate the residual stream in the text decoder
For both 1 and 2, vLLM is the source of truth
The paper will be out shortly to motivate these decisions

liuyt6515

9 days ago

@sanchit-gandhi
Could you please open-source the inference code on GitHub (rather than the vLLM GitHub repository)?
In addition, we would also appreciate it if the training code could be made open-source as well.

liuyt6515

9 days ago

We would be extremely grateful if you could provide the training code.

patrickvonplaten

Mistral AI_ org 8 days ago

•

edited 8 days ago

We would be extremely grateful if you could provide the training code.

That's not planned at the moment.

We're happy to answer some questions

patrickvonplaten

Mistral AI_ org 8 days ago

Transformers code will come out soon: https://github.com/huggingface/transformers/pull/43769

should help with training

liuyt6515

7 days ago

We would be extremely grateful if you could provide the training code.

That's not planned at the moment.

We're happy to answer some questions

Since the overall model architecture cannot be fully viewed solely from the vox‑related inference code inside vLLM, I have adopted the following method for LoRA fine‑tuning.(The weights have been successfully loaded via vLLM, but their effectiveness has not yet been verified.)

Training Layers: Only train LoRA weights for the language_model module
Specific Layers:
o_proj (output projection)
down_proj (MLP down projection)
gate_up_proj (split into gate_proj + up_proj)
qkv_proj (split into q_proj + k_proj + v_proj)
Method:
Uses custom LoRA wrappers (bypasses PEFT, directly wraps vLLM parallel layers)
Filters out whisper_encoder and audio_language_adapter (not supported by vLLM)

Could you help check if my approach makes sense before I verify its effectiveness? I have a feeling it’s gonna be another wasted effort, haha.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment