[Transformers Integration] Understanding Voxtral Realtime architecture for porting
Hi Mistral team 👋
I’m interested in contributing a Hugging Face Transformers integration for Voxtral Mini 4B Realtime 2602. After reading the vLLM implementation, here’s my current understanding (see attached GIF):
- Audio → mel → Whisper conv + pooling (~80ms / token) → causal / sliding-window audio encoder → adapter →
audio_embed - LLM input is element-wise sum (not concat):
audio_text_embeds + text_embeds(plus delay/time conditioning) - [STREAMING_PAD] fills the left-pad + initial delay window (e.g., ~480ms ≈ 6 tokens if 80ms/token)
- Decoding is AR text-only: only generated text tokens are fed back; audio continues streaming step-by-step
Questions
- Is the summary above correct?
- Is the causal audio encoder documented anywhere beyond the vLLM code?
- For Transformers, would a custom streaming wrapper (step-wise
max_new_tokens=1+ incremental audio chunks) be acceptable, or is there a preferred integration pattern? - Are [STREAMING_PAD] + delay/time embedding baked into weights/config, or mostly tokenizer-level handling?
- Any plans for a technical paper?
Happy to start a PR once the approach is aligned — thanks!
cc. @patrickvonplaten , @pandora-s , @iliasslasri , @juliendenize , @sebag90 , @sanchit-gandhi
Technical paper will come out as well. Your animation above looks correct and is very nice. Will try to help bring the transformers PR over the line.
Very nice animation @Seungyoun !
- Note that the encoder is a causal audio encoder trained from scratch (whereas Whisper is bi-directional), so new modelling code is required
- The time-delay is embedded using a sin/cos embedding, then projected via an MLP and used to modulate the residual stream in the text decoder
- For both 1 and 2, vLLM is the source of truth
- The paper will be out shortly to motivate these decisions
@sanchit-gandhi
Could you please open-source the inference code on GitHub (rather than the vLLM GitHub repository)?
In addition, we would also appreciate it if the training code could be made open-source as well.
We would be extremely grateful if you could provide the training code.
We would be extremely grateful if you could provide the training code.
That's not planned at the moment.
We're happy to answer some questions
Transformers code will come out soon: https://github.com/huggingface/transformers/pull/43769
should help with training
We would be extremely grateful if you could provide the training code.
That's not planned at the moment.
We're happy to answer some questions
Since the overall model architecture cannot be fully viewed solely from the vox‑related inference code inside vLLM, I have adopted the following method for LoRA fine‑tuning.(The weights have been successfully loaded via vLLM, but their effectiveness has not yet been verified.)
Training Layers: Only train LoRA weights for the language_model module
Specific Layers:
o_proj (output projection)
down_proj (MLP down projection)
gate_up_proj (split into gate_proj + up_proj)
qkv_proj (split into q_proj + k_proj + v_proj)
Method:
Uses custom LoRA wrappers (bypasses PEFT, directly wraps vLLM parallel layers)
Filters out whisper_encoder and audio_language_adapter (not supported by vLLM)
Could you help check if my approach makes sense before I verify its effectiveness? I have a feeling it’s gonna be another wasted effort, haha.
