Whisper Burn GGUF — Q4_0 Quantized Models

Q4_0 quantized GGUF versions of OpenAI'''s Whisper models, optimized for GPU inference with whisper-burn.

Files

File	Model	Size	Parameters
\	Whisper Large V3	~800 MB	1550M (32 encoder + 32 decoder layers)
\	Whisper Medium	~604 MB	769M (24 encoder + 24 decoder layers)
\	BPE tokenizer	~2.1 MB	Shared by all models

Format: GGUF v3 with Q4_0 quantization
What'''s quantized: 2D weight matrices with dimensions > 256 are quantized to 4-bit (Q4_0 blocks: f16 scale + 16 packed nibble bytes per 32 elements)
What stays F32: Token embeddings, positional embeddings, biases, layer norms, and small matrices
Conversion script: \ from the whisper-burn repository

Model	Mel bins	Hidden dim	Encoder layers	Decoder layers	Accuracy	Speed
Large V3	128	1280	32	32	Best	Slower
Medium	80	1024	24	24	Good	Fast

These models are automatically downloaded by the whisper-burn desktop application. You can also download them manually:

Place all files in a \ directory next to the whisper-burn executable.

whisper-burn is a native Rust implementation of OpenAI'''s Whisper using the Burn ML framework with GPU acceleration via wgpu (Vulkan/Metal/DirectX).

Key features:

Pure Rust — no Python, no ONNX, no external runtime
GPU-accelerated — custom WGSL compute shaders for fused Q4 dequantization + matrix multiplication
Push-to-Talk — global hotkey with support for any key combo including modifier-only (e.g. Ctrl+Win)
99+ languages — all Whisper-supported languages + automatic detection
Auto-paste — transcribed text automatically pasted into the active application
Windows native — desktop app with dark theme UI

The quantized weights inherit the license from the original OpenAI Whisper models (MIT License).

GGUF

Model size

2B params

Architecture

whisper

Hardware compatibility

We're not able to determine the quantization variants.