NextCoder-32B-FP8

This is an FP8 quantized version of microsoft/NextCoder-32B for efficient inference on NVIDIA Ada Lovelace and newer GPUs.

Model Description

FP8 (8-bit floating point) quantization of NextCoder-32B, optimized for fast code generation with minimal quality loss.

Quantization Details

Property	Value
Original Model	microsoft/NextCoder-32B
Quantization Method	FP8 (E4M3) via llm-compressor
Model Size	~64GB (sharded safetensors files)
Target Hardware	NVIDIA Ada Lovelace (RTX 40xx, RTX 5000 Ada, etc.)
Quantization Date	2025-11-23
Quantization Time	213.8 minutes
Hardware Used	NVIDIA RTX 5000 Ada Generation (31.5 GB)

Quantization Infrastructure

Quantized on professional hardware to ensure quality and reliability:

CPUs: Dual Intel Xeon Max 9480 (224 threads, 128GB HBM2e)
GPU: NVIDIA RTX 5000 Ada Generation (32GB VRAM) with native FP8 support
Memory: 256GB DDR5 + 128GB HBM2e = 384GB total
Software: Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13 | llm-compressor

Usage

Loading the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model with FP8 quantization
model = AutoModelForCausalLM.from_pretrained(
    "TevunahAi/NextCoder-32B-FP8",
    torch_dtype=torch.float8_e4m3fn,  # FP8 dtype
    device_map="auto",
    low_cpu_mem_usage=True,
)

tokenizer = AutoTokenizer.from_pretrained("TevunahAi/NextCoder-32B-FP8")

# Generate code
messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs, 
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Requirements

pip install torch>=2.1.0  # FP8 support requires PyTorch 2.1+
pip install transformers>=4.40.0
pip install accelerate

System Requirements:

PyTorch 2.1 or newer with CUDA support
NVIDIA GPU with FP8 support (Ada Lovelace or newer: RTX 40xx series, RTX 5000 Ada, H100, etc.)
CUDA 11.8 or newer
~64GB VRAM for inference (or use multi-GPU setup with device_map="auto")

Benefits of FP8

~50% memory reduction compared to FP16/BF16
Faster inference on Ada Lovelace and Hopper GPUs with native FP8 Tensor Cores
Minimal quality loss compared to INT8 or INT4 quantization
Native hardware acceleration on modern NVIDIA GPUs

Model Files

This model is sharded into multiple safetensors files. All files are required for inference.

Original Model

This quantization is based on microsoft/NextCoder-32B by Microsoft. Please refer to the original model card for:

Training details
Intended use cases
Capabilities and limitations
Evaluation results
Ethical considerations

Quantization Recipe

This model was quantized using llm-compressor with the FP8 E4M3 format. The quantization recipe is included in recipe.yaml.

License

This model inherits the MIT license from the original NextCoder-32B model.

Citation

If you use this model, please cite the original NextCoder work:

@misc{nextcoder2024,
  title={NextCoder: Next-Generation Code LLM},
  author={Microsoft},
  year={2024},
  url={https://huggingface.co/microsoft/NextCoder-32B}
}