Nanbeige4.1-3B-GGUF
GGUF quantizations of Nanbeige/Nanbeige4.1-3B for use with llama.cpp, Ollama, and other GGUF-compatible tools.
Available Quantizations
| File | Quant | Size | Description |
|---|---|---|---|
nanbeige4.1-3b-f16.gguf |
F16 | 7.4 GB | Full precision (no quantization) |
nanbeige4.1-3b-Q8_0.gguf |
Q8_0 | 3.9 GB | Best quality, largest quantized size |
nanbeige4.1-3b-Q6_K.gguf |
Q6_K | 3.1 GB | Very high quality |
nanbeige4.1-3b-Q5_K_M.gguf |
Q5_K_M | 2.7 GB | High quality |
nanbeige4.1-3b-Q4_K_M.gguf |
Q4_K_M | 2.3 GB | Good quality, recommended for most users |
nanbeige4.1-3b-Q3_K_M.gguf |
Q3_K_M | 1.9 GB | Medium quality |
nanbeige4.1-3b-Q2_K.gguf |
Q2_K | 1.6 GB | Smallest size, lower quality (received report of constantly stuck in a loop) |
Usage
Ollama
# Download a specific quantization (e.g. Q4_K_M)
ollama run hf.co/tantk/Nanbeige4.1-3B-GGUF:Q4_K_M
# Or create from a downloaded file
ollama create nanbeige4.1-3b -f Modelfile
llama.cpp
llama-cli -m nanbeige4.1-3b-Q4_K_M.gguf -p "Your prompt here" --temp 0.6 --top-p 0.95
Model Details
- Base Model: Nanbeige/Nanbeige4.1-3B
- Architecture: LlamaForCausalLM
- Parameters: 3B (4B total)
- Context Length: 131,072 tokens
- Chat Template: ChatML (
<|im_start|>/<|im_end|>) - License: Apache 2.0
Recommended Settings
- Temperature: 0.6
- Top-p: 0.95
- Repeat penalty: 1.0
Benchmark Results
Test Hardware
| Component | Spec |
|---|---|
| CPU | AMD Ryzen 5 5600G (6 cores / 12 threads, 3.9 GHz) |
| RAM | 32 GB DDR4-3200 (4x 8 GB Kingston) |
| GPU | NVIDIA GeForce RTX 4070 Ti (12 GB VRAM) |
| OS | Windows 11 Pro |
CPU Benchmark (llama-bench)
- Backend: CPU
- Threads: 6
- Prompt tokens: 512 (pp512)
- Generation tokens: 128 (tg128)
- Repetitions: 3
- Tool: llama-bench (llama.cpp build 0c1f39a)
| Quant | Size | Params | Prompt (t/s) | Generation (t/s) |
|---|---|---|---|---|
| Q2_K | 1.51 GiB | 3.93 B | 47.14 ± 0.71 | 20.99 ± 1.04 |
| Q3_K_M | 1.87 GiB | 3.93 B | 40.23 ± 1.01 | 17.65 ± 0.25 |
| Q4_K_M | 2.27 GiB | 3.93 B | 67.80 ± 1.14 | 14.35 ± 0.52 |
| Q5_K_M | 2.63 GiB | 3.93 B | 29.68 ± 0.24 | 13.75 ± 0.17 |
| Q6_K | 3.01 GiB | 3.93 B | 33.76 ± 2.41 | 12.28 ± 0.06 |
| Q8_0 | 3.89 GiB | 3.93 B | 45.07 ± 0.41 | 9.07 ± 0.47 |
| F16 | 7.33 GiB | 3.93 B | 31.08 ± 0.75 | 5.22 ± 0.05 |
GPU Benchmark (llama-bench)
- Backend: CUDA (RTX 4070 Ti, 100% GPU offload, ngl=99)
- Prompt tokens: 512 (pp512)
- Generation tokens: 128 (tg128)
- Repetitions: 3
- Tool: llama-bench (llama.cpp build 0c1f39a)
| Quant | Size | Params | Prompt (t/s) | Generation (t/s) |
|---|---|---|---|---|
| Q2_K | 1.51 GiB | 3.93 B | 7,904.89 ± 44.44 | 194.47 ± 1.68 |
| Q3_K_M | 1.87 GiB | 3.93 B | 9,233.97 ± 132.75 | 162.72 ± 1.04 |
| Q4_K_M | 2.27 GiB | 3.93 B | 9,977.17 ± 123.83 | 155.27 ± 0.21 |
| Q5_K_M | 2.63 GiB | 3.93 B | 8,060.71 ± 1484.42 | 139.18 ± 0.44 |
| Q6_K | 3.01 GiB | 3.93 B | 7,794.85 ± 1023.17 | 126.49 ± 0.83 |
| Q8_0 | 3.89 GiB | 3.93 B | 6,349.76 ± 698.63 | 102.88 ± 0.32 |
| F16 | 7.33 GiB | 3.93 B | 8,946.09 ± 230.61 | 60.75 ± 0.20 |
Credits
- Downloads last month
- 3,108
Hardware compatibility
Log In
to add your hardware
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support