Kimi-K2.5 Vision Weights

Vision-only weights extracted from moonshotai/Kimi-K2.5 for use with MLX-based inference.

Contents

  • kimi_k25_vision.safetensors β€” 335 tensors, ~899MB (BF16)
    • vision_tower.* β€” 329 tensors (MoonViT encoder, 27 layers)
    • mm_projector.* β€” 6 tensors (PatchMergerMLP projector)
  • config.json β€” vision config + projector metadata

Architecture

Component Details
Vision Encoder MoonViT: 27 layers, 1152 hidden, 16 heads, patch_size=14
Patch Merger 2Γ—2 spatial merge + temporal pool (no learned params)
Projector LayerNorm(1152) β†’ Linear(4608β†’4608) β†’ GELU β†’ Linear(4608β†’7168)
Total params ~450M

The vision encoder is identical to the one in Kimi-VL-A3B. The only difference is the projector output dimension (7168 for K2.5 vs 2048 for A3B), which matches K2.5's text backbone hidden size.

Usage

These weights are designed to be loaded alongside the text-only mlx-community/Kimi-K2.5 model to enable vision-language capabilities.

The vision encoder processes images into (N, 7168) embedding vectors that replace media placeholder tokens in the text embedding stream.

Reproduction

Extracted from shards 63+64 of moonshotai/Kimi-K2.5 (public, not gated). No modifications to the weights β€” original BF16 precision preserved.

To reproduce from scratch:

pip install safetensors huggingface_hub
python extract_vision_weights.py --output-dir ./

This downloads only the two relevant shards (~900MB), filters to vision_tower.* and mm_projector.* keys (335 total), and saves a single kimi_k25_vision.safetensors.

See extract_vision_weights.py for the full script.

License

Same license as the source model: Kimi-K2.5 License

Downloads last month
1,295
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for davehind/Kimi-K2.5-vision

Finetuned
(23)
this model