NVIDIA Nemotron-3-Nano-30B-A3B โ€” IQ4_XS GGUF (RTX 2080 Ti Optimized)

Production-validated IQ4_XS quantization of NVIDIA Nemotron-3-Nano-30B-A3B, benchmarked and tuned for 11GB VRAM GPUs (RTX 2080 Ti / RTX 3060 / etc).

GGUF sourced from bartowski/nvidia_Nemotron-3-Nano-30B-A3B-GGUF. This repo adds real-world benchmarks, production configs, and deployment guides for constrained-VRAM setups.

Why This Repo?

bartowski provides the quants โ€” we provide the deployment playbook for running this model on a single consumer GPU with real benchmark data, not projections.

  • โœ… Validated tool calling (function calls parse correctly)
  • โœ… Validated reasoning (chain-of-thought with <think> tags)
  • โœ… Production Ollama Modelfile included
  • โœ… llama-server launch configs with measured VRAM headroom
  • โœ… Tested across 168 messages in a multi-agent war room

Model Details

Property Value
Base Model NVIDIA Nemotron-3-Nano-30B-A3B
Architecture MoE Hybrid: 23 Mamba-2 + 6 Attention + 128 experts + 1 shared
Total Parameters 31.6B
Active Parameters ~3.2โ€“3.6B (Mixture of Experts)
Quantization IQ4_XS (imatrix)
File Size 16.8 GB
Format GGUF
Prompt Format ChatML (<|im_start|>)
Context Window Up to 5120 tested stable on 11GB (hardware dependent)
License NVIDIA Open Model License

Benchmarks (RTX 2080 Ti, 11GB VRAM)

All benchmarks on: Ryzen 7 3800X (8c/16t) ยท 32GB DDR4-3600 ยท RTX 2080 Ti 11GB ยท Ubuntu 24.04

Throughput

Backend GPU Layers KV Cache Context Flash Attn tok/s (single) tok/s (sustained)
Ollama (Q4_K_M) 28/52 (est.) default 4096 โŒ 11.04 โ€”
Ollama (IQ4_XS) 28/52 default 4096 โŒ 14.49 โ€”
llama-server (IQ4_XS) 28 q8_0 4096 โŒ 22.85 19.45โ€“19.69
llama-server (IQ4_XS) 28 q8_0 4096 โœ… 26.25 26.21โ€“26.70

VRAM Usage

Config VRAM Used Free Headroom
Ollama Q4_K_M 10,449 MiB 815 MiB โš ๏ธ Tight
Ollama IQ4_XS 10,079 MiB 1,185 MiB โœ… Good
llama-server ctx 2048 (ร—1 slot) 10,631 MiB 633 MiB โœ… Safe
llama-server ctx 3072 (ร—1 slot) 10,627 MiB 637 MiB โœ… Safe
llama-server ctx 4096 (ร—1 slot) 10,639 MiB 625 MiB โœ… Production
llama-server ctx 5120 (ร—1 slot) 10,641 MiB 623 MiB โœ… Safe
llama-server ctx 6144 (ร—1 slot) 10,947 MiB 317 MiB โš ๏ธ Tight
llama-server ctx 8192 (ร—1 slot) 11,085 MiB 179 MiB โŒ OOM risk
llama-server ctx 4096 (ร—4 slots) ~10,953 MiB 311 MiB โš ๏ธ Dangerous

Why does context barely affect VRAM? Nemotron's hybrid Mamba-2 architecture uses recurrent state (fixed size) for 46 of 52 layers. Only 6 attention layers need KV cache, so context scaling is nearly flat until ~6K where compute buffers jump.

IQ4_XS vs Q4_K_M (Same Model, Ollama)

Metric Q4_K_M IQ4_XS Delta
Throughput 11.04 tok/s 14.49 tok/s +31%
VRAM 10,449 MiB 10,079 MiB โˆ’370 MiB
Cold Start 42.7s 23.7s โˆ’44%
Disk Size 24 GB 16.8 GB โˆ’6 GB
Tool Calling โœ… โœ… โ€”
Reasoning โœ… โœ… โ€”

Quality Validation

  • Math reasoning: "What's 15% tip on $28.50?" โ†’ Correct ($4.28 โ†’ $4.275, rounds properly)
  • Tool calling: OpenAI-format function calls parse correctly
  • Multi-turn: Sustained coherence over 168-message war room session

Recommended Configurations

Ollama (Easy, Recommended for Most Users)

Create a Modelfile:

FROM ./nvidia_Nemotron-3-Nano-30B-A3B-IQ4_XS.gguf

PARAMETER temperature 0.3
PARAMETER top_k 40
PARAMETER top_p 0.85
PARAMETER num_ctx 4096
PARAMETER num_predict 512
PARAMETER repeat_penalty 1.1
ollama create nemotron-prod -f Modelfile
ollama run nemotron-prod

llama-server (Maximum Performance, +84% over Ollama)

# Production config (recommended for 11GB VRAM)
llama-server \
  --model nvidia_Nemotron-3-Nano-30B-A3B-IQ4_XS.gguf \
  --n-gpu-layers 28 \
  --cache-type-k q8_0 \
  --ctx-size 4096 \
  --parallel 1 \
  --flash-attn on \
  --port 8081

Flash Attention: Build llama.cpp with GGML_CUDA_FA=ON and use --flash-attn on (not auto โ€” auto may fail to enable on split GPU/CPU models). FA gives +35% sustained throughput on this model.

28 GPU layers is the stable ceiling for 11GB VRAM. 29 loads but OOMs during generation. 30+ fails at load.

Context window scaling (measured on RTX 2080 Ti, single slot):

  • --ctx-size 2048 โ†’ 633 MiB free โœ…
  • --ctx-size 3072 โ†’ 637 MiB free โœ…
  • --ctx-size 4096 โ†’ 625 MiB free โœ… โ† Production sweet spot
  • --ctx-size 5120 โ†’ 623 MiB free โœ… (max safe)
  • --ctx-size 6144 โ†’ 317 MiB free โš ๏ธ (too tight)
  • --ctx-size 8192 โ†’ 179 MiB free โŒ (OOM risk)

Note: The Mamba-2 hybrid architecture means context barely affects VRAM up to ~5K. The cliff happens around 6K where compute buffers resize.

systemd Service (Auto-start on Boot)

[Unit]
Description=Nemotron IQ4_XS llama-server (CUDA)
After=network.target

[Service]
Type=simple
User=your-user
ExecStart=/opt/llama-server/llama-server \
  --model /path/to/nvidia_Nemotron-3-Nano-30B-A3B-IQ4_XS.gguf \
  --n-gpu-layers 28 \
  --cache-type-k q8_0 \
  --ctx-size 4096 \
  --parallel 1 \
  --flash-attn on \
  --port 8081 \
  --host 127.0.0.1
Restart=on-failure
RestartSec=5
Environment=CUDA_VISIBLE_DEVICES=0

[Install]
WantedBy=multi-user.target

OpenAI-Compatible API

Both Ollama and llama-server expose /v1/chat/completions. Drop-in compatible with any OpenAI SDK client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8081/v1", api_key="none")
response = client.chat.completions.create(
    model="nemotron",
    messages=[{"role": "user", "content": "Hello!"}]
)

GPU Layer Guide

VRAM GPU Layers Est. tok/s Notes
8 GB 20โ€“22 ~12 Tight, reduce context
11 GB 28 22โ€“27 Sweet spot (this config)
16 GB 35โ€“40 ~28โ€“32 Comfortable headroom
22 GB 52 (all) ~35โ€“40 Full GPU offload ๐Ÿš€
24 GB 52 (all) ~35โ€“40 Full offload + large context

22GB upgrade note: With 22GB VRAM, all 52 layers fit on GPU โ€” zero CPU offload, maximum throughput. Expected ~2x improvement over 28-layer config.

File Listing

File Size Description
nvidia_Nemotron-3-Nano-30B-A3B-IQ4_XS.gguf 16.8 GB IQ4_XS quantized model (imatrix)
Modelfile <1 KB Production Ollama Modelfile
README.md โ€” This file

Quantization Source

GGUF quantized by bartowski using llama.cpp release b7423 with imatrix calibration data. See bartowski/nvidia_Nemotron-3-Nano-30B-A3B-GGUF for all available quant sizes.

Credits

  • Model: NVIDIA โ€” Nemotron-3-Nano-30B-A3B
  • Quantization: bartowski โ€” IQ4_XS imatrix GGUF
  • Benchmarking & Deployment: Tinker-Stack โ€” Production validation on RTX 2080 Ti with Disclaw multi-agent war room
Downloads last month
793
GGUF
Model size
32B params
Architecture
nemotron_h_moe
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Tinker-Stack/Nemotron-3-Nano-30B-A3B-IQ4_XS-GGUF

Quantized
(22)
this model