ubergarm commited on
Commit
cb5a12f
·
1 Parent(s): b1a65d7

update README and add vram usage image

Browse files
README.md CHANGED
@@ -40,7 +40,7 @@ Special mix `IQ3_K_R4`/`IQ2_K_R4` routed experts with all other layers full `q8_
40
  Great for CPU+GPU "troll rig" high end gamer systems e.g. 9950X 96 GB RAM + 3090TI 24 GB VRAM + Gen 5 NVMe SSD.
41
 
42
  #### Custom Mixes
43
- If you have multiple GPUs and more VRAM, you can make custom quants to optimize size and quants whatever hardware you have. If you have less VRAM, you could make a custom quant leaner in the non routed expert layers or get 64k+ context in 24GB VRAM. Also you can use the offline repack tool if you want to do CPU only with `mmap()` still enabled.
44
 
45
  ## Quick Start
46
  #### `ik_llama.cpp` API server for GPU+CPU
@@ -97,7 +97,9 @@ numactl -N 0 -m 0 \
97
 
98
  These are probably the **best quants available in this size class** for `V3-0324`!
99
 
100
- ![Benchmarks showing these quants are smaller in size yet similar in performance to the `Q8_0`](benchmarks-01.png "Benchmarks showing these quants are smaller in size yet similar in performance to the `Q8_0`")
 
 
101
 
102
  ubergarm made no sacrifices for token embedding, attention, dense
103
  layers, or shared experts. This is possible because `ik_llama.cpp` MLA
@@ -105,11 +107,16 @@ implementation saves so much GPU VRAM enabling 32k context in under 24GB
105
  VRAM. Also these quants use a new high quality imatrix including various
106
  coding samples and multiple written languages. Routed expert layers
107
  make use of SotA CPU `IQx_K_R4` non-linear quants as well for likely
108
- best perplexity per GiB.
 
 
109
 
110
  bartowski uses full token embedding quality but lower attention, dense
111
  layers, and shared expert quants. He does use a good quality imatrix with
112
  perplexity performance within the measurement error relative to this one.
 
 
 
113
 
114
  unsloth sacrifices token embedding with middle quality attention and
115
  dense layers, but no importance matrix.
@@ -126,7 +133,7 @@ provide details on [their recipe as well here](https://huggingface.co/mradermach
126
 
127
  | | [ubergarm/DeepSeek-V3-0324-IQ2_K_R4](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF?show_file_info=DeepSeek-V3-0324-IQ2_K_R4%2FDeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf) | [bartowski/DeepSeek-V3-0324-Q2_K_L](https://huggingface.co/bartowski/deepseek-ai_DeepSeek-V3-0324-GGUF?show_file_info=deepseek-ai_DeepSeek-V3-0324-Q2_K_L%2Fdeepseek-ai_DeepSeek-V3-0324-Q2_K_L-00001-of-00007.gguf) | [unsloth/DeepSeek-V3-0324-UD-Q2_K_XL](https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF?show_file_info=UD-Q2_K_XL%2FDeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf) | [mradermacher/DeepSeek-V3-0324-i1-GGUF-Q2_K](https://huggingface.co/mradermacher/DeepSeek-V3-0324-i1-GGUF) |
128
  | --- | --- | --- | --- | --- |
129
- | **Overview** | | | | |
130
  | `split.tensors.count` | 1147 | 1025 | 1025 | |
131
  | `token_embd.weight` | `Q8_0` | `Q8_0` | `Q4_K` | `IQ3_S` |
132
  | `output.weight` | | | | `Q5_K` |
@@ -167,8 +174,10 @@ provide details on [their recipe as well here](https://huggingface.co/mradermach
167
  | `blk.[3-60].ffn_up_shexp.weight` | `Q8_0` | `Q2_K` | `Q4_K` | `IQ2_XS`|
168
  | `blk.[3-60].attn_output.weight` | `Q8_0` | `Q3_K` | `Q4_K` | `IQ3_S` |
169
  | **Important Matrix & Perplexity** | | | | |
170
- | `imatrix.dataset` | `calibration_data_v5_rc.txt`| `calibration_datav3.txt` | `imatrix-training-full-3` | ? |
171
- | Final PPL (wiki.test.raw) | 3.5614 +/- 0.02001 | ? | ? | ? |
 
 
172
 
173
  </details>
174
 
 
40
  Great for CPU+GPU "troll rig" high end gamer systems e.g. 9950X 96 GB RAM + 3090TI 24 GB VRAM + Gen 5 NVMe SSD.
41
 
42
  #### Custom Mixes
43
+ If you have more than 48GB VRAM across multiple GPUs, consider rolling your can own custom quants to optimize size and performance with whatever hardware you have using custom `-ot` expression. If you have less VRAM, you could make a custom quant leaner in the non routed expert layers or get 64k+ context in 24GB VRAM. Also you can use the offline repack tool if you want to do CPU only with `mmap()` still enabled.
44
 
45
  ## Quick Start
46
  #### `ik_llama.cpp` API server for GPU+CPU
 
97
 
98
  These are probably the **best quants available in this size class** for `V3-0324`!
99
 
100
+ ![Benchmarks showing these quants are smaller in size yet similar in performance to the `Q8_0`](images/benchmarks-01.png "Benchmarks showing these quants are smaller in size yet similar in performance to the `Q8_0`")
101
+
102
+ ![VRAM Usage Chart](images/vram-usage.png "Chart showing linear VRAM usage vs context length.")
103
 
104
  ubergarm made no sacrifices for token embedding, attention, dense
105
  layers, or shared experts. This is possible because `ik_llama.cpp` MLA
 
107
  VRAM. Also these quants use a new high quality imatrix including various
108
  coding samples and multiple written languages. Routed expert layers
109
  make use of SotA CPU `IQx_K_R4` non-linear quants as well for likely
110
+ best perplexity per GiB. Both the `IQ2_K_R4` and `IQ4_K_R4` are designed
111
+ for ~17.33GiB weights offloaded to GPU VRAM with remaining VRAM available
112
+ for context.
113
 
114
  bartowski uses full token embedding quality but lower attention, dense
115
  layers, and shared expert quants. He does use a good quality imatrix with
116
  perplexity performance within the measurement error relative to this one.
117
+ *UPDATE*: Also check out bartowski's new customized ["V2" flavors](https://huggingface.co/bartowski/deepseek-ai_DeepSeek-V3-0324-GGUF#v2-uploads)
118
+ recipes with improved perplexity for the size!!! The table below is his
119
+ original flavor quants.
120
 
121
  unsloth sacrifices token embedding with middle quality attention and
122
  dense layers, but no importance matrix.
 
133
 
134
  | | [ubergarm/DeepSeek-V3-0324-IQ2_K_R4](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF?show_file_info=DeepSeek-V3-0324-IQ2_K_R4%2FDeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf) | [bartowski/DeepSeek-V3-0324-Q2_K_L](https://huggingface.co/bartowski/deepseek-ai_DeepSeek-V3-0324-GGUF?show_file_info=deepseek-ai_DeepSeek-V3-0324-Q2_K_L%2Fdeepseek-ai_DeepSeek-V3-0324-Q2_K_L-00001-of-00007.gguf) | [unsloth/DeepSeek-V3-0324-UD-Q2_K_XL](https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF?show_file_info=UD-Q2_K_XL%2FDeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf) | [mradermacher/DeepSeek-V3-0324-i1-GGUF-Q2_K](https://huggingface.co/mradermacher/DeepSeek-V3-0324-i1-GGUF) |
135
  | --- | --- | --- | --- | --- |
136
+ | **Overview** | | "V1" | | |
137
  | `split.tensors.count` | 1147 | 1025 | 1025 | |
138
  | `token_embd.weight` | `Q8_0` | `Q8_0` | `Q4_K` | `IQ3_S` |
139
  | `output.weight` | | | | `Q5_K` |
 
174
  | `blk.[3-60].ffn_up_shexp.weight` | `Q8_0` | `Q2_K` | `Q4_K` | `IQ2_XS`|
175
  | `blk.[3-60].attn_output.weight` | `Q8_0` | `Q3_K` | `Q4_K` | `IQ3_S` |
176
  | **Important Matrix & Perplexity** | | | | |
177
+ | `imatrix.dataset` | `calibration_data_v5_rc.txt`| `calibration_datav3.txt` | none | `imatrix-training-full-3` |
178
+ | Final PPL (wiki.test.raw) | 3.5614 +/- 0.02001 | 3.9012 (V1) | ? | ? |
179
+
180
+ For reference the `Q8_0` achieves `PPL = 3.3482 +/- 0.01847` on same `wiki.test.raw` file.
181
 
182
  </details>
183
 
benchmarks-01.png → images/benchmarks-01.png RENAMED
File without changes
images/vram-usage.png ADDED

Git LFS Details

  • SHA256: 5e23bef37f0710df6283563910516d36b653bdfe6f48ff2c264be2f4a7db0398
  • Pointer size: 131 Bytes
  • Size of remote file: 183 kB