Nanbeige4.1-3B-GGUF

GGUF quantizations of Nanbeige/Nanbeige4.1-3B for use with llama.cpp, Ollama, and other GGUF-compatible tools.

Available Quantizations

File Quant Size Description
nanbeige4.1-3b-f16.gguf F16 7.4 GB Full precision (no quantization)
nanbeige4.1-3b-Q8_0.gguf Q8_0 3.9 GB Best quality, largest quantized size
nanbeige4.1-3b-Q6_K.gguf Q6_K 3.1 GB Very high quality
nanbeige4.1-3b-Q5_K_M.gguf Q5_K_M 2.7 GB High quality
nanbeige4.1-3b-Q4_K_M.gguf Q4_K_M 2.3 GB Good quality, recommended for most users
nanbeige4.1-3b-Q3_K_M.gguf Q3_K_M 1.9 GB Medium quality
nanbeige4.1-3b-Q2_K.gguf Q2_K 1.6 GB Smallest size, lower quality (received report of constantly stuck in a loop)

Usage

Ollama

# Download a specific quantization (e.g. Q4_K_M)
ollama run hf.co/tantk/Nanbeige4.1-3B-GGUF:Q4_K_M

# Or create from a downloaded file
ollama create nanbeige4.1-3b -f Modelfile

llama.cpp

llama-cli -m nanbeige4.1-3b-Q4_K_M.gguf -p "Your prompt here" --temp 0.6 --top-p 0.95

Model Details

  • Base Model: Nanbeige/Nanbeige4.1-3B
  • Architecture: LlamaForCausalLM
  • Parameters: 3B (4B total)
  • Context Length: 131,072 tokens
  • Chat Template: ChatML (<|im_start|> / <|im_end|>)
  • License: Apache 2.0

Recommended Settings

  • Temperature: 0.6
  • Top-p: 0.95
  • Repeat penalty: 1.0

Benchmark Results

Test Hardware

Component Spec
CPU AMD Ryzen 5 5600G (6 cores / 12 threads, 3.9 GHz)
RAM 32 GB DDR4-3200 (4x 8 GB Kingston)
GPU NVIDIA GeForce RTX 4070 Ti (12 GB VRAM)
OS Windows 11 Pro

CPU Benchmark (llama-bench)

  • Backend: CPU
  • Threads: 6
  • Prompt tokens: 512 (pp512)
  • Generation tokens: 128 (tg128)
  • Repetitions: 3
  • Tool: llama-bench (llama.cpp build 0c1f39a)
Quant Size Params Prompt (t/s) Generation (t/s)
Q2_K 1.51 GiB 3.93 B 47.14 ± 0.71 20.99 ± 1.04
Q3_K_M 1.87 GiB 3.93 B 40.23 ± 1.01 17.65 ± 0.25
Q4_K_M 2.27 GiB 3.93 B 67.80 ± 1.14 14.35 ± 0.52
Q5_K_M 2.63 GiB 3.93 B 29.68 ± 0.24 13.75 ± 0.17
Q6_K 3.01 GiB 3.93 B 33.76 ± 2.41 12.28 ± 0.06
Q8_0 3.89 GiB 3.93 B 45.07 ± 0.41 9.07 ± 0.47
F16 7.33 GiB 3.93 B 31.08 ± 0.75 5.22 ± 0.05

GPU Benchmark (llama-bench)

  • Backend: CUDA (RTX 4070 Ti, 100% GPU offload, ngl=99)
  • Prompt tokens: 512 (pp512)
  • Generation tokens: 128 (tg128)
  • Repetitions: 3
  • Tool: llama-bench (llama.cpp build 0c1f39a)
Quant Size Params Prompt (t/s) Generation (t/s)
Q2_K 1.51 GiB 3.93 B 7,904.89 ± 44.44 194.47 ± 1.68
Q3_K_M 1.87 GiB 3.93 B 9,233.97 ± 132.75 162.72 ± 1.04
Q4_K_M 2.27 GiB 3.93 B 9,977.17 ± 123.83 155.27 ± 0.21
Q5_K_M 2.63 GiB 3.93 B 8,060.71 ± 1484.42 139.18 ± 0.44
Q6_K 3.01 GiB 3.93 B 7,794.85 ± 1023.17 126.49 ± 0.83
Q8_0 3.89 GiB 3.93 B 6,349.76 ± 698.63 102.88 ± 0.32
F16 7.33 GiB 3.93 B 8,946.09 ± 230.61 60.75 ± 0.20

Credits

Original model by Nanbeige. Quantized with llama.cpp.

Downloads last month
3,108
GGUF
Model size
4B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tantk/Nanbeige4.1-3B-GGUF

Quantized
(41)
this model