Kimi-K2.5 Vision Weights
Vision-only weights extracted from moonshotai/Kimi-K2.5 for use with MLX-based inference.
Contents
kimi_k25_vision.safetensorsβ 335 tensors, ~899MB (BF16)vision_tower.*β 329 tensors (MoonViT encoder, 27 layers)mm_projector.*β 6 tensors (PatchMergerMLP projector)
config.jsonβ vision config + projector metadata
Architecture
| Component | Details |
|---|---|
| Vision Encoder | MoonViT: 27 layers, 1152 hidden, 16 heads, patch_size=14 |
| Patch Merger | 2Γ2 spatial merge + temporal pool (no learned params) |
| Projector | LayerNorm(1152) β Linear(4608β4608) β GELU β Linear(4608β7168) |
| Total params | ~450M |
The vision encoder is identical to the one in Kimi-VL-A3B. The only difference is the projector output dimension (7168 for K2.5 vs 2048 for A3B), which matches K2.5's text backbone hidden size.
Usage
These weights are designed to be loaded alongside the text-only mlx-community/Kimi-K2.5 model to enable vision-language capabilities.
The vision encoder processes images into (N, 7168) embedding vectors that replace media placeholder tokens in the text embedding stream.
Reproduction
Extracted from shards 63+64 of moonshotai/Kimi-K2.5 (public, not gated).
No modifications to the weights β original BF16 precision preserved.
To reproduce from scratch:
pip install safetensors huggingface_hub
python extract_vision_weights.py --output-dir ./
This downloads only the two relevant shards (~900MB), filters to
vision_tower.* and mm_projector.* keys (335 total), and saves
a single kimi_k25_vision.safetensors.
See extract_vision_weights.py for the full script.
License
Same license as the source model: Kimi-K2.5 License
- Downloads last month
- 1,295
Quantized
Model tree for davehind/Kimi-K2.5-vision
Base model
moonshotai/Kimi-K2.5