NextCoder-32B-FP8
This is an FP8 quantized version of microsoft/NextCoder-32B for efficient inference on NVIDIA Ada Lovelace and newer GPUs.
Model Description
FP8 (8-bit floating point) quantization of NextCoder-32B, optimized for fast code generation with minimal quality loss.
Quantization Details
| Property | Value |
|---|---|
| Original Model | microsoft/NextCoder-32B |
| Quantization Method | FP8 (E4M3) via llm-compressor |
| Model Size | ~64GB (sharded safetensors files) |
| Target Hardware | NVIDIA Ada Lovelace (RTX 40xx, RTX 5000 Ada, etc.) |
| Quantization Date | 2025-11-23 |
| Quantization Time | 213.8 minutes |
| Hardware Used | NVIDIA RTX 5000 Ada Generation (31.5 GB) |
Quantization Infrastructure
Quantized on professional hardware to ensure quality and reliability:
- CPUs: Dual Intel Xeon Max 9480 (224 threads, 128GB HBM2e)
- GPU: NVIDIA RTX 5000 Ada Generation (32GB VRAM) with native FP8 support
- Memory: 256GB DDR5 + 128GB HBM2e = 384GB total
- Software: Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13 | llm-compressor
Usage
Loading the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model with FP8 quantization
model = AutoModelForCausalLM.from_pretrained(
"TevunahAi/NextCoder-32B-FP8",
torch_dtype=torch.float8_e4m3fn, # FP8 dtype
device_map="auto",
low_cpu_mem_usage=True,
)
tokenizer = AutoTokenizer.from_pretrained("TevunahAi/NextCoder-32B-FP8")
# Generate code
messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Requirements
pip install torch>=2.1.0 # FP8 support requires PyTorch 2.1+
pip install transformers>=4.40.0
pip install accelerate
System Requirements:
- PyTorch 2.1 or newer with CUDA support
- NVIDIA GPU with FP8 support (Ada Lovelace or newer: RTX 40xx series, RTX 5000 Ada, H100, etc.)
- CUDA 11.8 or newer
- ~64GB VRAM for inference (or use multi-GPU setup with device_map="auto")
Benefits of FP8
- ~50% memory reduction compared to FP16/BF16
- Faster inference on Ada Lovelace and Hopper GPUs with native FP8 Tensor Cores
- Minimal quality loss compared to INT8 or INT4 quantization
- Native hardware acceleration on modern NVIDIA GPUs
Model Files
This model is sharded into multiple safetensors files. All files are required for inference.
Original Model
This quantization is based on microsoft/NextCoder-32B by Microsoft. Please refer to the original model card for:
- Training details
- Intended use cases
- Capabilities and limitations
- Evaluation results
- Ethical considerations
Quantization Recipe
This model was quantized using llm-compressor with the FP8 E4M3 format. The quantization recipe is included in recipe.yaml.
License
This model inherits the MIT license from the original NextCoder-32B model.
Citation
If you use this model, please cite the original NextCoder work:
@misc{nextcoder2024,
title={NextCoder: Next-Generation Code LLM},
author={Microsoft},
year={2024},
url={https://huggingface.co/microsoft/NextCoder-32B}
}
Acknowledgments
- Original model by Microsoft
- Quantization performed using Neural Magic's llm-compressor
- Quantized by TevunahAi
- Downloads last month
- 7
Model tree for TevunahAi/NextCoder-32B-FP8
Base model
Qwen/Qwen2.5-32B