πŸš€ ROCKET: Rapid Optimization via Calibration-guided Knapsack Enhanced Truncation for Efficient Model Compression

arXiv License Python 3.9+ Kaggle ROCKET Architecture In a quiet corner of the AI research lab, a cartoon rocket stood on the launchpadβ€”bright red, cheerful, and boldly labeled β€œLLM.” At the control console sat a scientist, fingers hovering over a single, enormous red button marked β€œSolve MCKP.” With a deep breath and a flicker of hope, they pressed it. The rocket roared to life. Flames erupted, scattering clouds of sparse matrices like confetti made of zeros. As the LLM blasted into the stratosphere of efficient inference, it left behind on the pad a humble knapsack overflowing not with gold, but with perfectly balanced (rank, sparsity) pairs: the optimal solutions to the Multiple-Choice Knapsack Problem, handpicked for model compression. Up it soared lighter, faster, smarter carrying only what truly mattered.

Model Description

ROCKET (Rapid Optimization via Calibration-guided Knapsack Enhanced Truncation) is a novel training-free model compression method that achieves state-of-the-art performance by combining two key innovations:

  • Multi-Choice Knapsack Budget Allocation: Formulates layer-wise compression as an optimization problem, selecting the optimal (rank, sparsity) configuration per layer to minimize total reconstruction error under a global parameter budget
  • Single-Step Sparse Factorization: Uses calibration-guided structured sparsification with closed-form dictionary updates via least squares bypassing iterative optimization, sparse coding, or backpropagation entirely

The approach operates in whitened activation space, applies importance-weighted sparsification, and recovers compressed weights as a product of two factors compatible with standard dense linear algebra.

Key Features

  • πŸš€ Training-Free Compression: No fine-tuning required; compresses LLMs in minutes using only a small calibration set (~256 samples)
  • 🎯 Optimal Budget Allocation: Dynamic programming solves layer-wise compression allocation to preserve performance where it matters most
  • ⚑ Single-Step Factorization: Replaces expensive K-SVD/OMP with eigen decomposition + closed-form least squares 96Γ— faster than baselines
  • πŸ” Union-of-Subspaces Flexibility: Each output dimension activates a distinct subset of basis vectors, overcoming rigid low-rank constraints
  • πŸ”Œ Hardware-Compatible: Produces structured sparse factorizations that merge seamlessly during inference

πŸ”₯ Performance: Qwen3-14B β†’ 8B Compression vs. Native Qwen3-8B

Method Compression State PIQA 🧠 HellaSwag πŸ”„ LAMBADA πŸ¦™ ARC-e πŸ”¬ ARC-c 🧩 SciQ πŸ“š Race 🏁 MMLU πŸŽ“ Avg. Acc πŸ“Š LAMBADA PPL ↓
Qwen3-14B (dense) – baseline 79.86 78.85 67.88 82.82 60.23 96.50 43.25 77.20 73.32 3.7
Qwen3-8B (dense) – baseline 77.70 74.90 64.10 80.70 56.70 95.70 40.90 73.00 70.46 4.6
ROCKET-Qwen3-8B 40% (14B→8B) training-free 72.68 62.63 70.26 67.76 44.19 91.20 39.80 59.99 63.56 3.8
ROCKET-Qwen3-8B (healed) ✨ 40% + 30M tokens light fine-tune 78.45 πŸ† 73.54 64.86 πŸ† 76.94 51.45 95.10 40.67 66.69 68.46 4.6

Key Takeaways:

  • βœ… Training-free ROCKET retains ~90% of the native 8B model's accuracy (63.56 vs 70.46) with zero fine-tuning
  • ✨ With minimal healing (30M tokens, fixed sparsity), ROCKET reaches 97.2% of native 8B performanceβ€”nearly matching a model trained from scratch
  • πŸ“‰ Perplexity after healing (4.6) is identical to the native 8B model
  • πŸ’‘ This demonstrates a practical alternative to multi-size training: train one large model, compress to any target size with ROCKET

🌍 Environmental Impact: ROCKET consumes 100Γ— less energy and produces 23Γ— lower COβ‚‚ emissions than iterative dictionary learning baselines.

ROCKET

Installation

We highly recommend using this docker image to ensure reproducability.

pytorch/pytorch:2.7.1-cuda12.6-cudnn9-devel 

Then run

pip install -e .

Running

We provide multiple console entrypoints to run the full pipeline you can easily do

rocket-run-pipeline --config "./rocket/config/default.yaml"

you can use the sample config fie and modify it according to your requirements Other entrypoint are:

rocket-profile-layers --config CONFIG # To do profiling only
rocket-compress --config CONFIG #run compression only
rocket-evaluate --config CONFIG # Evaluation only
rocket-gather-activations --config CONFIG # Prepare Calibration data

Inference optimized

Note that we provide in extra folder a modeling file to run the optimized verison which includes implementation of Macko and fuzed layers. to use the optimized version after you finish compression you load the model from the modeling file and call optimize

from transformers import AutoModelForCausalLM, AutoTokenizer
from modeling_llama_svdllm_opt import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained("MODEL_PATH", device_map="cuda", torch_dtype="float16", compression_path="./cr_llama.json")
tokenizer = AutoTokenizer.from_pretrained("MODEL_PATH")
model.optimize()
model = torch.compile(model, mode="reduce-overhead", fullgraph=True)

Citation

If you use ROCKET in your research, please cite our paper:

@article{ali2026rocket0,
  title   = {ROCKET: Rapid Optimization via Calibration-guided Knapsack Enhanced Truncation for Efficient Model Compression},
  author  = {Ammar Ali and Baher Mohammad and Denis Makhov and Dmitriy Shopkhoev and Magauiya Zhussip and Stamatios Lefkimmiatis},
  year    = {2026},
  journal = {arXiv preprint arXiv: 2602.11008}
}

Credit in inference optimization to :

@article{macko2025macko0,
  title   = {MACKO: Sparse Matrix-Vector Multiplication for Low Sparsity},
  author  = {VladimΓ­r Macko and VladimΓ­r BoΕΎa},
  year    = {2025},
  journal = {arXiv preprint arXiv: 2511.13061}
}
Downloads last month
14
Safetensors
Model size
14B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including MTSAIR/ROCKET-Qwen-8b

Paper for MTSAIR/ROCKET-Qwen-8b