Doradus commited on
Commit
87d0180
·
verified ·
1 Parent(s): 6309def

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +372 -0
README.md ADDED
@@ -0,0 +1,372 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ pipeline_tag: text-generation
4
+ license: mit
5
+ language:
6
+ - en
7
+ base_model:
8
+ - miromind-ai/MiroThinker-v1.0-30B
9
+ tags:
10
+ - agent
11
+ - open-source
12
+ - miromind
13
+ - deep-research
14
+ - fp8
15
+ - quantized
16
+ - vllm
17
+ - sglang
18
+ ---
19
+
20
+ # MiroThinker-v1.0-30B-FP8
21
+
22
+ <div align="center">
23
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/68525b342230a897a65cc1c0/87mYQ_a-4jpnMkVR4hrgm.png" width="55%" alt="MiroThinker" />
24
+ </div>
25
+
26
+ ## Model Description
27
+
28
+ This is an **FP8 quantized** version of [miromind-ai/MiroThinker-v1.0-30B](https://huggingface.co/miromind-ai/MiroThinker-v1.0-30B), created using [llmcompressor](https://github.com/vllm-project/llm-compressor) (Neural Magic).
29
+
30
+ **Key Benefits:**
31
+ - ~50% smaller model size (30GB vs 60GB)
32
+ - ~2x faster inference on FP8-capable GPUs (Ada Lovelace, Hopper)
33
+ - Native vLLM and SGLang support
34
+ - Minimal quality loss with FP8 dynamic quantization
35
+
36
+ ## Quantization Details
37
+
38
+ | Property | Value |
39
+ |----------|-------|
40
+ | Quantization Method | FP8 Dynamic (W8A8) |
41
+ | Weights Precision | FP8 E4M3 (8-bit) |
42
+ | Activations Precision | FP8 E4M3 (8-bit, dynamic) |
43
+ | Ignored Layers | `lm_head` (kept in BF16) |
44
+ | Quantization Tool | llmcompressor 0.12.2 |
45
+ | Original Model Size | ~60GB |
46
+ | Quantized Model Size | ~30GB |
47
+
48
+ ### Quantization Recipe
49
+
50
+ ```yaml
51
+ default_stage:
52
+ default_modifiers:
53
+ QuantizationModifier:
54
+ targets: [Linear]
55
+ ignore: [lm_head]
56
+ scheme: FP8_DYNAMIC
57
+ ```
58
+
59
+ ## Quick Start with Docker
60
+
61
+ The easiest way to run this model. No setup required - just Docker with NVIDIA runtime.
62
+
63
+ ### Docker Compose (Recommended)
64
+
65
+ ```bash
66
+ # Download docker-compose.yml
67
+ wget https://huggingface.co/Doradus/MiroThinker-v1.0-30B-FP8/raw/main/docker/docker-compose.yml
68
+
69
+ # Run with 2 GPUs (recommended)
70
+ docker compose up
71
+
72
+ # Or single GPU (not recommended - poor performance)
73
+ SINGLE_GPU=1 docker compose up
74
+ ```
75
+
76
+ ### Docker Run
77
+
78
+ ```bash
79
+ # TP=2 with 2 GPUs (recommended)
80
+ docker run --gpus all -p 8000:8000 \
81
+ -v hf_cache:/root/.cache/huggingface \
82
+ --shm-size=16g \
83
+ vllm/vllm-openai:v0.11.2 \
84
+ --model Doradus/MiroThinker-v1.0-30B-FP8 \
85
+ --tensor-parallel-size 2 \
86
+ --max-model-len 32768 \
87
+ --trust-remote-code
88
+
89
+ # Single GPU fallback (expect ~1-2 tok/s)
90
+ docker run --gpus '"device=0"' -p 8000:8000 \
91
+ -v hf_cache:/root/.cache/huggingface \
92
+ --shm-size=16g \
93
+ vllm/vllm-openai:v0.11.2 \
94
+ --model Doradus/MiroThinker-v1.0-30B-FP8 \
95
+ --tensor-parallel-size 1 \
96
+ --max-model-len 2048 \
97
+ --max-num-seqs 4 \
98
+ --gpu-memory-utilization 0.95 \
99
+ --enforce-eager \
100
+ --trust-remote-code
101
+ ```
102
+
103
+ ### Test the API
104
+
105
+ ```bash
106
+ curl http://localhost:8000/v1/chat/completions \
107
+ -H "Content-Type: application/json" \
108
+ -d '{
109
+ "model": "Doradus/MiroThinker-v1.0-30B-FP8",
110
+ "messages": [{"role": "user", "content": "Hello!"}],
111
+ "max_tokens": 100
112
+ }'
113
+ ```
114
+
115
+ ## Usage
116
+
117
+ ### vLLM (Recommended)
118
+
119
+ ```bash
120
+ python -m vllm.entrypoints.openai.api_server \
121
+ --model Doradus/MiroThinker-v1.0-30B-FP8 \
122
+ --tensor-parallel-size 2 \
123
+ --max-model-len 131072 \
124
+ --trust-remote-code
125
+ ```
126
+
127
+ ### SGLang
128
+
129
+ ```bash
130
+ python -m sglang.launch_server \
131
+ --model-path Doradus/MiroThinker-v1.0-30B-FP8 \
132
+ --host 0.0.0.0 \
133
+ --port 8000 \
134
+ --tp 2
135
+ ```
136
+
137
+ ### Transformers (for inspection only)
138
+
139
+ ```python
140
+ from transformers import AutoModelForCausalLM, AutoTokenizer
141
+
142
+ model = AutoModelForCausalLM.from_pretrained(
143
+ "Doradus/MiroThinker-v1.0-30B-FP8",
144
+ device_map="auto",
145
+ trust_remote_code=True,
146
+ )
147
+ tokenizer = AutoTokenizer.from_pretrained("Doradus/MiroThinker-v1.0-30B-FP8")
148
+ ```
149
+
150
+ ## Recommended Inference Parameters
151
+
152
+ For optimal performance in agentic tasks (from the original MiroThinker documentation):
153
+
154
+ ```python
155
+ temperature = 1.0
156
+ top_p = 0.95
157
+ repetition_penalty = 1.05
158
+ max_context_length = 262144
159
+ max_tokens = 16384
160
+ ```
161
+
162
+ ## Architecture Details
163
+
164
+ This is a **Mixture of Experts (MoE)** model based on Qwen3MoE architecture:
165
+
166
+ | Property | Value |
167
+ |----------|-------|
168
+ | Total Parameters | ~30B (all experts) |
169
+ | Active Parameters | ~3.3B per forward pass |
170
+ | Hidden Size | 2048 |
171
+ | Attention Heads | 32 |
172
+ | KV Heads (GQA) | 4 |
173
+ | Layers | 48 |
174
+ | Experts | 128 total, 8 active per token |
175
+ | MoE Intermediate Size | 768 per expert |
176
+ | Max Context | 262,144 tokens |
177
+ | Vocabulary | 151,936 tokens |
178
+
179
+ ## Hardware Requirements
180
+
181
+ ### VRAM Analysis
182
+
183
+ Model weights: **30GB** (vs 57GB BF16 original)
184
+
185
+ | Context Length | KV Cache (FP16) | Total VRAM | Fits Single GPU? |
186
+ |----------------|-----------------|------------|------------------|
187
+ | 2K tokens | ~0.1 GB | ~31 GB | RTX 5090 (tight) |
188
+ | 4K tokens | ~0.2 GB | ~31 GB | RTX 5090 (tight) |
189
+ | 8K tokens | ~0.4 GB | ~31 GB | RTX 5090 |
190
+ | 16K tokens | ~0.8 GB | ~32 GB | A100-40GB |
191
+ | 32K tokens | ~1.6 GB | ~33 GB | A100-40GB |
192
+ | 64K tokens | ~3.2 GB | ~35 GB | A100-80GB |
193
+ | 131K tokens | ~6.4 GB | ~38 GB | A100-80GB / H100 |
194
+ | 262K tokens | ~12.8 GB | ~45 GB | H100 or TP=2 |
195
+
196
+ *KV cache calculated for GQA with 4 KV heads, 128 head_dim, 48 layers, FP16 KV*
197
+
198
+ ### Recommended Configurations
199
+
200
+ | GPU Setup | Max Context | Performance | Notes |
201
+ |-----------|-------------|-------------|-------|
202
+ | 1x RTX 4090 (24GB) | OOM | N/A | Model too large |
203
+ | 1x RTX 5090 (32GB) | ~2K tokens | ~1-2 tok/s | Requires `--enforce-eager`, **not recommended** |
204
+ | **2x RTX 4090 (24GB) TP=2** | ~16K tokens | ~60 tok/s | **Recommended consumer config** |
205
+ | **2x RTX 5090 (32GB) TP=2** | ~32K tokens | ~80 tok/s | **Recommended consumer config** |
206
+ | 1x A100-40GB | ~8K tokens | ~40 tok/s | Single GPU possible |
207
+ | 2x A100-40GB TP=2 | ~64K tokens | ~80 tok/s | Good production config |
208
+ | 1x A100-80GB | ~131K tokens | ~60 tok/s | TP=1 possible |
209
+ | 1x H100-80GB | ~262K tokens | ~120 tok/s | Full context, TP=1 |
210
+
211
+ ### Single 32GB GPU Limitations
212
+
213
+ The model weights alone require **29.2 GiB**, leaving minimal headroom for KV cache on a 32GB GPU. Single RTX 5090 operation is technically possible but **not recommended for production**:
214
+
215
+ - Requires `--enforce-eager` (disables CUDA graphs, significant performance penalty)
216
+ - Maximum context: ~2048 tokens
217
+ - Throughput: ~1-2 tokens/second (severely memory-bound)
218
+ - No headroom for batched requests
219
+
220
+ **If you only have a single 32GB GPU**, this configuration will work but with poor performance:
221
+ ```bash
222
+ python -m vllm.entrypoints.openai.api_server \
223
+ --model Doradus/MiroThinker-v1.0-30B-FP8 \
224
+ --max-model-len 2048 \
225
+ --max-num-seqs 4 \
226
+ --gpu-memory-utilization 0.95 \
227
+ --enforce-eager \
228
+ --trust-remote-code
229
+ ```
230
+
231
+ **Strongly recommended**: Use TP=2 with two 24GB+ GPUs for usable performance.
232
+
233
+ **Note**: FP8 inference requires CUDA compute capability 8.9+ (Ada Lovelace) or 9.0+ (Hopper) for optimal performance. On older GPUs, the model will run but may use fallback kernels.
234
+
235
+ ## Quality & Performance
236
+
237
+ ### FP8 vs BF16 Comparison
238
+
239
+ | Metric | BF16 Original | FP8 Quantized | Delta |
240
+ |--------|---------------|---------------|-------|
241
+ | Model Size | 57 GB | 30 GB | **-47%** |
242
+ | Load Time | ~45s | ~25s | **-44%** |
243
+ | Memory BW | ~2x | 1x (baseline) | FP8 wins |
244
+
245
+ ### Expected Quality Retention
246
+
247
+ FP8 dynamic quantization (W8A8) typically preserves >99% of model quality for reasoning tasks. The `lm_head` is kept in BF16 to maintain output distribution fidelity.
248
+
249
+ **Why FP8 Dynamic?**
250
+ - No calibration data needed (faster quantization)
251
+ - Dynamic activation quantization adapts per-token
252
+ - E4M3 format balances range and precision well for LLMs
253
+ - Native hardware support on Ada/Hopper (no overhead)
254
+
255
+ ### Original Model Benchmarks (from [arXiv paper](https://arxiv.org/abs/2511.11793))
256
+
257
+ MiroThinker is an **agentic research model** designed for multi-turn tool use, not traditional LLM benchmarks. The original BF16 model was evaluated on agent-specific benchmarks requiring up to 600 tool calls per task:
258
+
259
+ | Benchmark | MiroThinker-30B (BF16) | Description |
260
+ |-----------|------------------------|-------------|
261
+ | **GAIA** | ~70% | General AI Assistant (tool use) |
262
+ | **BrowseComp** | ~40% | Web browsing comprehension |
263
+ | **BrowseComp-ZH** | ~50% | Chinese web browsing |
264
+ | **HLE-Text** | ~30% | Humanity's Last Exam |
265
+
266
+ *Scores from paper scaling analysis
267
+
268
+ **Note on FP8 quality**: These agentic benchmarks require full agent infrastructure (browser, tools, multi-turn execution) and cannot be directly run on the quantized model in isolation. However, FP8 W8A8 dynamic quantization typically preserves **>99% of model quality** based on extensive research ([Neural Magic](https://neuralmagic.com/blog/fp8-quantization/), [vLLM benchmarks](https://docs.vllm.ai/en/latest/quantization/fp8.html)).
269
+
270
+ ### Supplementary Benchmarks (lm-evaluation-harness)
271
+
272
+ For reference, we ran traditional LLM benchmarks, though these don't reflect the model's primary use case:
273
+
274
+ | Benchmark | Metric | Score | Notes |
275
+ |-----------|--------|-------|-------|
276
+ | IFEval | Instruction-level (loose) | 46.0% | Instruction following |
277
+ | IFEval | Instruction-level (strict) | 44.2% | Strict compliance |
278
+ | GSM8K (5-shot) | Exact Match (flexible) | 18.0% | Not optimized for this |
279
+
280
+ The GSM8K scores reflect the model's `<think>` block reasoning behavior (suited for agentic tasks) rather than direct answer generation.
281
+
282
+ *Supplementary benchmarks run 2025-12-03 using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)*
283
+
284
+ ### Measured Throughput (RTX PRO 6000 Blackwell, 96GB)
285
+
286
+ Tested on vLLM with TP=2, 32K max context:
287
+
288
+ | Test Type | Tokens Generated | Time | Throughput |
289
+ |-----------|------------------|------|------------|
290
+ | Short reasoning | 100 | 0.83s | **119.9 tok/s** |
291
+ | Code generation | 256 | 2.1s | **121.7 tok/s** |
292
+ | Long explanation | 512 | 4.24s | **120.8 tok/s** |
293
+ | **Average** | 868 | 7.17s | **121.1 tok/s** |
294
+
295
+ **VRAM Usage**: ~45GB per GPU (TP=2) at 32K context
296
+
297
+ *Tested 2025-12-03 on Doradus infrastructure with vLLM 0.11.x*
298
+
299
+ ## Reproduction
300
+
301
+ To reproduce this quantization:
302
+
303
+ ```python
304
+ #!/usr/bin/env python3
305
+ """
306
+ Quantize MiroThinker-30B to FP8 using llmcompressor (Neural Magic)
307
+ Dynamic quantization - no calibration data needed, fast conversion
308
+ Output is vLLM-compatible FP8
309
+ """
310
+
311
+ from llmcompressor import oneshot
312
+ from llmcompressor.modifiers.quantization import QuantizationModifier
313
+ import torch
314
+
315
+ MODEL_PATH = "miromind-ai/MiroThinker-v1.0-30B"
316
+ OUTPUT_PATH = "./MiroThinker-v1.0-30B-FP8"
317
+
318
+ recipe = QuantizationModifier(
319
+ targets="Linear",
320
+ scheme="FP8_DYNAMIC",
321
+ ignore=["lm_head"],
322
+ )
323
+
324
+ oneshot(
325
+ model=MODEL_PATH,
326
+ output_dir=OUTPUT_PATH,
327
+ recipe=recipe,
328
+ num_calibration_samples=0,
329
+ save_compressed=True,
330
+ )
331
+ ```
332
+
333
+ **Requirements:**
334
+ ```
335
+ pip install llmcompressor torch transformers accelerate
336
+ ```
337
+
338
+ ## Original Model
339
+
340
+ This quantization is based on [miromind-ai/MiroThinker-v1.0-30B](https://huggingface.co/miromind-ai/MiroThinker-v1.0-30B).
341
+
342
+ MiroThinker v1.0 is an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Key features:
343
+
344
+ - 256K context window
345
+ - Up to 600 tool calls per task
346
+ - Interactive scaling via RL training
347
+ - Strong performance on HLE-Text, BrowseComp, GAIA benchmarks
348
+
349
+ For full details, see the [MiroThinker paper](https://arxiv.org/abs/2511.11793) and [GitHub repository](https://github.com/MiroMindAI/MiroThinker).
350
+
351
+ ## License
352
+
353
+ This model inherits the **MIT License** from the original MiroThinker model.
354
+
355
+ ## Citation
356
+
357
+ If you use this model, please cite the original MiroThinker paper:
358
+
359
+ ```bibtex
360
+ @article{miromind2025mirothinker,
361
+ title={MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling},
362
+ author={MiroMind Team and Bai, Song and Bing, Lidong and Chen, Carson and Chen, Guanzheng and Chen, Yuntao and Chen, Zhe and Chen, Ziyi and Dai, Jifeng and Dong, Xuan and others},
363
+ journal={arXiv preprint arXiv:2511.11793},
364
+ year={2025}
365
+ }
366
+ ```
367
+
368
+ ## Acknowledgements
369
+
370
+ - [MiroMind AI](https://miromind.ai/) for the original MiroThinker model
371
+ - [Neural Magic / vLLM](https://github.com/vllm-project/llm-compressor) for llmcompressor
372
+ - [DoradusAI](https://doradusonline.com) for the quantization