update README and add vram usage image
Browse files- README.md +15 -6
- benchmarks-01.png → images/benchmarks-01.png +0 -0
- images/vram-usage.png +3 -0
README.md
CHANGED
|
@@ -40,7 +40,7 @@ Special mix `IQ3_K_R4`/`IQ2_K_R4` routed experts with all other layers full `q8_
|
|
| 40 |
Great for CPU+GPU "troll rig" high end gamer systems e.g. 9950X 96 GB RAM + 3090TI 24 GB VRAM + Gen 5 NVMe SSD.
|
| 41 |
|
| 42 |
#### Custom Mixes
|
| 43 |
-
If you have multiple GPUs
|
| 44 |
|
| 45 |
## Quick Start
|
| 46 |
#### `ik_llama.cpp` API server for GPU+CPU
|
|
@@ -97,7 +97,9 @@ numactl -N 0 -m 0 \
|
|
| 97 |
|
| 98 |
These are probably the **best quants available in this size class** for `V3-0324`!
|
| 99 |
|
| 100 |
-

|
|
|
|
|
|
|
| 101 |
|
| 102 |
ubergarm made no sacrifices for token embedding, attention, dense
|
| 103 |
layers, or shared experts. This is possible because `ik_llama.cpp` MLA
|
|
@@ -105,11 +107,16 @@ implementation saves so much GPU VRAM enabling 32k context in under 24GB
|
|
| 105 |
VRAM. Also these quants use a new high quality imatrix including various
|
| 106 |
coding samples and multiple written languages. Routed expert layers
|
| 107 |
make use of SotA CPU `IQx_K_R4` non-linear quants as well for likely
|
| 108 |
-
best perplexity per GiB.
|
|
|
|
|
|
|
| 109 |
|
| 110 |
bartowski uses full token embedding quality but lower attention, dense
|
| 111 |
layers, and shared expert quants. He does use a good quality imatrix with
|
| 112 |
perplexity performance within the measurement error relative to this one.
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
unsloth sacrifices token embedding with middle quality attention and
|
| 115 |
dense layers, but no importance matrix.
|
|
@@ -126,7 +133,7 @@ provide details on [their recipe as well here](https://huggingface.co/mradermach
|
|
| 126 |
|
| 127 |
| | [ubergarm/DeepSeek-V3-0324-IQ2_K_R4](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF?show_file_info=DeepSeek-V3-0324-IQ2_K_R4%2FDeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf) | [bartowski/DeepSeek-V3-0324-Q2_K_L](https://huggingface.co/bartowski/deepseek-ai_DeepSeek-V3-0324-GGUF?show_file_info=deepseek-ai_DeepSeek-V3-0324-Q2_K_L%2Fdeepseek-ai_DeepSeek-V3-0324-Q2_K_L-00001-of-00007.gguf) | [unsloth/DeepSeek-V3-0324-UD-Q2_K_XL](https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF?show_file_info=UD-Q2_K_XL%2FDeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf) | [mradermacher/DeepSeek-V3-0324-i1-GGUF-Q2_K](https://huggingface.co/mradermacher/DeepSeek-V3-0324-i1-GGUF) |
|
| 128 |
| --- | --- | --- | --- | --- |
|
| 129 |
-
| **Overview** | |
|
| 130 |
| `split.tensors.count` | 1147 | 1025 | 1025 | |
|
| 131 |
| `token_embd.weight` | `Q8_0` | `Q8_0` | `Q4_K` | `IQ3_S` |
|
| 132 |
| `output.weight` | | | | `Q5_K` |
|
|
@@ -167,8 +174,10 @@ provide details on [their recipe as well here](https://huggingface.co/mradermach
|
|
| 167 |
| `blk.[3-60].ffn_up_shexp.weight` | `Q8_0` | `Q2_K` | `Q4_K` | `IQ2_XS`|
|
| 168 |
| `blk.[3-60].attn_output.weight` | `Q8_0` | `Q3_K` | `Q4_K` | `IQ3_S` |
|
| 169 |
| **Important Matrix & Perplexity** | | | | |
|
| 170 |
-
| `imatrix.dataset` | `calibration_data_v5_rc.txt`| `calibration_datav3.txt` | `imatrix-training-full-3` |
|
| 171 |
-
| Final PPL (wiki.test.raw) | 3.5614 +/- 0.02001 |
|
|
|
|
|
|
|
| 172 |
|
| 173 |
</details>
|
| 174 |
|
|
|
|
| 40 |
Great for CPU+GPU "troll rig" high end gamer systems e.g. 9950X 96 GB RAM + 3090TI 24 GB VRAM + Gen 5 NVMe SSD.
|
| 41 |
|
| 42 |
#### Custom Mixes
|
| 43 |
+
If you have more than 48GB VRAM across multiple GPUs, consider rolling your can own custom quants to optimize size and performance with whatever hardware you have using custom `-ot` expression. If you have less VRAM, you could make a custom quant leaner in the non routed expert layers or get 64k+ context in 24GB VRAM. Also you can use the offline repack tool if you want to do CPU only with `mmap()` still enabled.
|
| 44 |
|
| 45 |
## Quick Start
|
| 46 |
#### `ik_llama.cpp` API server for GPU+CPU
|
|
|
|
| 97 |
|
| 98 |
These are probably the **best quants available in this size class** for `V3-0324`!
|
| 99 |
|
| 100 |
+

|
| 101 |
+
|
| 102 |
+

|
| 103 |
|
| 104 |
ubergarm made no sacrifices for token embedding, attention, dense
|
| 105 |
layers, or shared experts. This is possible because `ik_llama.cpp` MLA
|
|
|
|
| 107 |
VRAM. Also these quants use a new high quality imatrix including various
|
| 108 |
coding samples and multiple written languages. Routed expert layers
|
| 109 |
make use of SotA CPU `IQx_K_R4` non-linear quants as well for likely
|
| 110 |
+
best perplexity per GiB. Both the `IQ2_K_R4` and `IQ4_K_R4` are designed
|
| 111 |
+
for ~17.33GiB weights offloaded to GPU VRAM with remaining VRAM available
|
| 112 |
+
for context.
|
| 113 |
|
| 114 |
bartowski uses full token embedding quality but lower attention, dense
|
| 115 |
layers, and shared expert quants. He does use a good quality imatrix with
|
| 116 |
perplexity performance within the measurement error relative to this one.
|
| 117 |
+
*UPDATE*: Also check out bartowski's new customized ["V2" flavors](https://huggingface.co/bartowski/deepseek-ai_DeepSeek-V3-0324-GGUF#v2-uploads)
|
| 118 |
+
recipes with improved perplexity for the size!!! The table below is his
|
| 119 |
+
original flavor quants.
|
| 120 |
|
| 121 |
unsloth sacrifices token embedding with middle quality attention and
|
| 122 |
dense layers, but no importance matrix.
|
|
|
|
| 133 |
|
| 134 |
| | [ubergarm/DeepSeek-V3-0324-IQ2_K_R4](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF?show_file_info=DeepSeek-V3-0324-IQ2_K_R4%2FDeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf) | [bartowski/DeepSeek-V3-0324-Q2_K_L](https://huggingface.co/bartowski/deepseek-ai_DeepSeek-V3-0324-GGUF?show_file_info=deepseek-ai_DeepSeek-V3-0324-Q2_K_L%2Fdeepseek-ai_DeepSeek-V3-0324-Q2_K_L-00001-of-00007.gguf) | [unsloth/DeepSeek-V3-0324-UD-Q2_K_XL](https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF?show_file_info=UD-Q2_K_XL%2FDeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf) | [mradermacher/DeepSeek-V3-0324-i1-GGUF-Q2_K](https://huggingface.co/mradermacher/DeepSeek-V3-0324-i1-GGUF) |
|
| 135 |
| --- | --- | --- | --- | --- |
|
| 136 |
+
| **Overview** | | "V1" | | |
|
| 137 |
| `split.tensors.count` | 1147 | 1025 | 1025 | |
|
| 138 |
| `token_embd.weight` | `Q8_0` | `Q8_0` | `Q4_K` | `IQ3_S` |
|
| 139 |
| `output.weight` | | | | `Q5_K` |
|
|
|
|
| 174 |
| `blk.[3-60].ffn_up_shexp.weight` | `Q8_0` | `Q2_K` | `Q4_K` | `IQ2_XS`|
|
| 175 |
| `blk.[3-60].attn_output.weight` | `Q8_0` | `Q3_K` | `Q4_K` | `IQ3_S` |
|
| 176 |
| **Important Matrix & Perplexity** | | | | |
|
| 177 |
+
| `imatrix.dataset` | `calibration_data_v5_rc.txt`| `calibration_datav3.txt` | none | `imatrix-training-full-3` |
|
| 178 |
+
| Final PPL (wiki.test.raw) | 3.5614 +/- 0.02001 | 3.9012 (V1) | ? | ? |
|
| 179 |
+
|
| 180 |
+
For reference the `Q8_0` achieves `PPL = 3.3482 +/- 0.01847` on same `wiki.test.raw` file.
|
| 181 |
|
| 182 |
</details>
|
| 183 |
|
benchmarks-01.png → images/benchmarks-01.png
RENAMED
|
File without changes
|
images/vram-usage.png
ADDED
|
Git LFS Details
|