update README and add vram usage image

Browse files

Files changed (3) hide show

README.md +15 -6
benchmarks-01.png → images/benchmarks-01.png +0 -0
images/vram-usage.png +3 -0

README.md CHANGED Viewed

@@ -40,7 +40,7 @@ Special mix `IQ3_K_R4`/`IQ2_K_R4` routed experts with all other layers full `q8_
 Great for CPU+GPU "troll rig" high end gamer systems e.g. 9950X 96 GB RAM + 3090TI 24 GB VRAM + Gen 5 NVMe SSD.
 #### Custom Mixes
-If you have multiple GPUs and more VRAM, you can make custom quants to optimize size and quants whatever hardware you have. If you have less VRAM, you could make a custom quant leaner in the non routed expert layers or get 64k+ context in 24GB VRAM. Also you can use the offline repack tool if you want to do CPU only with `mmap()` still enabled.
 ## Quick Start
 #### `ik_llama.cpp` API server for GPU+CPU
@@ -97,7 +97,9 @@ numactl -N 0 -m 0 \
 These are probably the **best quants available in this size class** for `V3-0324`!
-![Benchmarks showing these quants are smaller in size yet similar in performance to the `Q8_0`](benchmarks-01.png "Benchmarks showing these quants are smaller in size yet similar in performance to the `Q8_0`")
 ubergarm made no sacrifices for token embedding, attention, dense
 layers, or shared experts. This is possible because `ik_llama.cpp` MLA
@@ -105,11 +107,16 @@ implementation saves so much GPU VRAM enabling 32k context in under 24GB
 VRAM. Also these quants use a new high quality imatrix including various
 coding samples and multiple written languages.  Routed expert layers
 make use of SotA CPU `IQx_K_R4` non-linear quants as well for likely
-best perplexity per GiB.
 bartowski uses full token embedding quality but lower attention, dense
 layers, and shared expert quants. He does use a good quality imatrix with
 perplexity performance within the measurement error relative to this one.
 unsloth sacrifices token embedding with middle quality attention and
 dense layers, but no importance matrix.
@@ -126,7 +133,7 @@ provide details on [their recipe as well here](https://huggingface.co/mradermach
 | | [ubergarm/DeepSeek-V3-0324-IQ2_K_R4](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF?show_file_info=DeepSeek-V3-0324-IQ2_K_R4%2FDeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf) | [bartowski/DeepSeek-V3-0324-Q2_K_L](https://huggingface.co/bartowski/deepseek-ai_DeepSeek-V3-0324-GGUF?show_file_info=deepseek-ai_DeepSeek-V3-0324-Q2_K_L%2Fdeepseek-ai_DeepSeek-V3-0324-Q2_K_L-00001-of-00007.gguf) | [unsloth/DeepSeek-V3-0324-UD-Q2_K_XL](https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF?show_file_info=UD-Q2_K_XL%2FDeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf) | [mradermacher/DeepSeek-V3-0324-i1-GGUF-Q2_K](https://huggingface.co/mradermacher/DeepSeek-V3-0324-i1-GGUF) |
 | --- | --- | --- | --- | --- |
-| **Overview**                       |            |        |        |         |
 | `split.tensors.count`              |  1147      |  1025  |  1025  |         |
 | `token_embd.weight`                | `Q8_0`     | `Q8_0` | `Q4_K` | `IQ3_S` |
 | `output.weight`                    |            |        |        | `Q5_K`  |
@@ -167,8 +174,10 @@ provide details on [their recipe as well here](https://huggingface.co/mradermach
 | `blk.[3-60].ffn_up_shexp.weight`   | `Q8_0`     | `Q2_K` | `Q4_K` | `IQ2_XS`|
 | `blk.[3-60].attn_output.weight`    | `Q8_0`     | `Q3_K` | `Q4_K` | `IQ3_S` |
 | **Important Matrix & Perplexity**  |            |        |        |         |
-| `imatrix.dataset`                  | `calibration_data_v5_rc.txt`| `calibration_datav3.txt` | `imatrix-training-full-3` | ? |
-| Final PPL (wiki.test.raw)          | 3.5614 +/- 0.02001  | ?      | ?  | ?  |
 </details>

 Great for CPU+GPU "troll rig" high end gamer systems e.g. 9950X 96 GB RAM + 3090TI 24 GB VRAM + Gen 5 NVMe SSD.
 #### Custom Mixes
+If you have more than 48GB VRAM across multiple GPUs, consider rolling your can own custom quants to optimize size and performance with whatever hardware you have using custom `-ot` expression. If you have less VRAM, you could make a custom quant leaner in the non routed expert layers or get 64k+ context in 24GB VRAM. Also you can use the offline repack tool if you want to do CPU only with `mmap()` still enabled.
 ## Quick Start
 #### `ik_llama.cpp` API server for GPU+CPU
 These are probably the **best quants available in this size class** for `V3-0324`!
+![Benchmarks showing these quants are smaller in size yet similar in performance to the `Q8_0`](images/benchmarks-01.png "Benchmarks showing these quants are smaller in size yet similar in performance to the `Q8_0`")
+![VRAM Usage Chart](images/vram-usage.png "Chart showing linear VRAM usage vs context length.")
 ubergarm made no sacrifices for token embedding, attention, dense
 layers, or shared experts. This is possible because `ik_llama.cpp` MLA
 VRAM. Also these quants use a new high quality imatrix including various
 coding samples and multiple written languages.  Routed expert layers
 make use of SotA CPU `IQx_K_R4` non-linear quants as well for likely
+best perplexity per GiB. Both the `IQ2_K_R4` and `IQ4_K_R4` are designed
+for ~17.33GiB weights offloaded to GPU VRAM with remaining VRAM available
+for context.
 bartowski uses full token embedding quality but lower attention, dense
 layers, and shared expert quants. He does use a good quality imatrix with
 perplexity performance within the measurement error relative to this one.
+*UPDATE*: Also check out bartowski's new customized ["V2" flavors](https://huggingface.co/bartowski/deepseek-ai_DeepSeek-V3-0324-GGUF#v2-uploads)
+recipes with improved perplexity for the size!!! The table below is his
+original flavor quants.
 unsloth sacrifices token embedding with middle quality attention and
 dense layers, but no importance matrix.
 | | [ubergarm/DeepSeek-V3-0324-IQ2_K_R4](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF?show_file_info=DeepSeek-V3-0324-IQ2_K_R4%2FDeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf) | [bartowski/DeepSeek-V3-0324-Q2_K_L](https://huggingface.co/bartowski/deepseek-ai_DeepSeek-V3-0324-GGUF?show_file_info=deepseek-ai_DeepSeek-V3-0324-Q2_K_L%2Fdeepseek-ai_DeepSeek-V3-0324-Q2_K_L-00001-of-00007.gguf) | [unsloth/DeepSeek-V3-0324-UD-Q2_K_XL](https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF?show_file_info=UD-Q2_K_XL%2FDeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf) | [mradermacher/DeepSeek-V3-0324-i1-GGUF-Q2_K](https://huggingface.co/mradermacher/DeepSeek-V3-0324-i1-GGUF) |
 | --- | --- | --- | --- | --- |
+| **Overview**                       |            | "V1"   |        |         |
 | `split.tensors.count`              |  1147      |  1025  |  1025  |         |
 | `token_embd.weight`                | `Q8_0`     | `Q8_0` | `Q4_K` | `IQ3_S` |
 | `output.weight`                    |            |        |        | `Q5_K`  |
 | `blk.[3-60].ffn_up_shexp.weight`   | `Q8_0`     | `Q2_K` | `Q4_K` | `IQ2_XS`|
 | `blk.[3-60].attn_output.weight`    | `Q8_0`     | `Q3_K` | `Q4_K` | `IQ3_S` |
 | **Important Matrix & Perplexity**  |            |        |        |         |
+| `imatrix.dataset`                  | `calibration_data_v5_rc.txt`| `calibration_datav3.txt` | none | `imatrix-training-full-3` |
+| Final PPL (wiki.test.raw)          | 3.5614 +/- 0.02001  |  3.9012 (V1) | ?  | ?  |
+For reference the `Q8_0` achieves `PPL = 3.3482 +/- 0.01847` on same `wiki.test.raw` file.
 </details>

benchmarks-01.png → images/benchmarks-01.png RENAMED Viewed

File without changes

images/vram-usage.png ADDED Viewed

Git LFS Details

SHA256: 5e23bef37f0710df6283563910516d36b653bdfe6f48ff2c264be2f4a7db0398
Pointer size: 131 Bytes
Size of remote file: 183 kB