NextCoder-32B-FP8

This is an FP8 quantized version of microsoft/NextCoder-32B for efficient inference on NVIDIA Ada Lovelace and newer GPUs.

Model Description

FP8 (8-bit floating point) quantization of NextCoder-32B, optimized for fast code generation with minimal quality loss.

Quantization Details

Property Value
Original Model microsoft/NextCoder-32B
Quantization Method FP8 (E4M3) via llm-compressor
Model Size ~64GB (sharded safetensors files)
Target Hardware NVIDIA Ada Lovelace (RTX 40xx, RTX 5000 Ada, etc.)
Quantization Date 2025-11-23
Quantization Time 213.8 minutes
Hardware Used NVIDIA RTX 5000 Ada Generation (31.5 GB)

Quantization Infrastructure

Quantized on professional hardware to ensure quality and reliability:

  • CPUs: Dual Intel Xeon Max 9480 (224 threads, 128GB HBM2e)
  • GPU: NVIDIA RTX 5000 Ada Generation (32GB VRAM) with native FP8 support
  • Memory: 256GB DDR5 + 128GB HBM2e = 384GB total
  • Software: Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13 | llm-compressor

Usage

Loading the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model with FP8 quantization
model = AutoModelForCausalLM.from_pretrained(
    "TevunahAi/NextCoder-32B-FP8",
    torch_dtype=torch.float8_e4m3fn,  # FP8 dtype
    device_map="auto",
    low_cpu_mem_usage=True,
)

tokenizer = AutoTokenizer.from_pretrained("TevunahAi/NextCoder-32B-FP8")

# Generate code
messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs, 
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Requirements

pip install torch>=2.1.0  # FP8 support requires PyTorch 2.1+
pip install transformers>=4.40.0
pip install accelerate

System Requirements:

  • PyTorch 2.1 or newer with CUDA support
  • NVIDIA GPU with FP8 support (Ada Lovelace or newer: RTX 40xx series, RTX 5000 Ada, H100, etc.)
  • CUDA 11.8 or newer
  • ~64GB VRAM for inference (or use multi-GPU setup with device_map="auto")

Benefits of FP8

  • ~50% memory reduction compared to FP16/BF16
  • Faster inference on Ada Lovelace and Hopper GPUs with native FP8 Tensor Cores
  • Minimal quality loss compared to INT8 or INT4 quantization
  • Native hardware acceleration on modern NVIDIA GPUs

Model Files

This model is sharded into multiple safetensors files. All files are required for inference.

Original Model

This quantization is based on microsoft/NextCoder-32B by Microsoft. Please refer to the original model card for:

  • Training details
  • Intended use cases
  • Capabilities and limitations
  • Evaluation results
  • Ethical considerations

Quantization Recipe

This model was quantized using llm-compressor with the FP8 E4M3 format. The quantization recipe is included in recipe.yaml.

License

This model inherits the MIT license from the original NextCoder-32B model.

Citation

If you use this model, please cite the original NextCoder work:

@misc{nextcoder2024,
  title={NextCoder: Next-Generation Code LLM},
  author={Microsoft},
  year={2024},
  url={https://huggingface.co/microsoft/NextCoder-32B}
}

Acknowledgments

  • Original model by Microsoft
  • Quantization performed using Neural Magic's llm-compressor
  • Quantized by TevunahAi
Downloads last month
7
Safetensors
Model size
33B params
Tensor type
F32
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TevunahAi/NextCoder-32B-FP8

Base model

Qwen/Qwen2.5-32B
Quantized
(11)
this model

Collection including TevunahAi/NextCoder-32B-FP8