TrevorJS
/

voxtral-mini-realtime-gguf

+---
+license: apache-2.0
+language:
+  - en
+  - fr
+  - de
+  - es
+  - it
+  - pt
+  - nl
+  - pl
+  - ru
+  - ja
+  - ko
+  - zh
+  - ar
+  - hi
+tags:
+  - speech
+  - asr
+  - voxtral
+  - gguf
+  - q4
+  - webgpu
+  - wasm
+  - streaming
+library_name: burn
+pipeline_tag: automatic-speech-recognition
+base_model: mistralai/Voxtral-Mini-4B-Realtime-2602
+---
+# Voxtral Mini 4B Realtime — Q4 GGUF
+Q4_0 quantized GGUF weights for [Voxtral Mini 4B Realtime](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602), converted for use with [voxtral-mini-realtime-rs](https://github.com/TrevorS/voxtral-mini-realtime-rs).
+## Files
+| File | Size | Description |
+|------|------|-------------|
+| `voxtral-q4.gguf` | 2.51 GB | Q4_0 quantized model weights (GGUF v3) |
+| `tekken.json` | 14.9 MB | Tekken tokenizer (131,072 vocab) |
+## Usage
+### Native CLI
+```bash
+cargo run --features "wgpu,cli,hub" --bin voxtral-transcribe -- \
+  --audio input.wav \
+  --gguf models/voxtral-q4.gguf \
+  --tokenizer models/voxtral/tekken.json
+```
+### Browser (WASM + WebGPU)
+The Q4 GGUF is designed to run entirely client-side in a browser tab via WebGPU. See the [GitHub repo](https://github.com/TrevorS/voxtral-mini-realtime-rs) for the full WASM build and dev server setup.
+## Quantization Details
+- **Method**: Q4_0 (4-bit quantization, block size 32, 18 bytes per block)
+- **Original model**: [mistralai/Voxtral-Mini-4B-Realtime-2602](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) (~16 GB F32)
+- **Quantized size**: ~2.5 GB (fits in browser memory)
+- **Inference**: Custom WGSL shader for fused GPU dequantize + matmul
+## Links
+- **Code**: [github.com/TrevorS/voxtral-mini-realtime-rs](https://github.com/TrevorS/voxtral-mini-realtime-rs)
+- **Original model**: [mistralai/Voxtral-Mini-4B-Realtime-2602](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602)