MiroThinker-v1.0-30B-FP8 / README.md

Doradus

Upload README.md with huggingface_hub

87d0180 verified 2 days ago

preview code

raw

history blame contribute delete

12 kB

metadata

library_name: transformers
pipeline_tag: text-generation
license: mit
language:
  - en
base_model:
  - miromind-ai/MiroThinker-v1.0-30B
tags:
  - agent
  - open-source
  - miromind
  - deep-research
  - fp8
  - quantized
  - vllm
  - sglang

MiroThinker-v1.0-30B-FP8

Model Description

This is an FP8 quantized version of miromind-ai/MiroThinker-v1.0-30B, created using llmcompressor (Neural Magic).

Key Benefits:

~50% smaller model size (30GB vs 60GB)
~2x faster inference on FP8-capable GPUs (Ada Lovelace, Hopper)
Native vLLM and SGLang support
Minimal quality loss with FP8 dynamic quantization

Quantization Details

Property	Value
Quantization Method	FP8 Dynamic (W8A8)
Weights Precision	FP8 E4M3 (8-bit)
Activations Precision	FP8 E4M3 (8-bit, dynamic)
Ignored Layers	`lm_head` (kept in BF16)
Quantization Tool	llmcompressor 0.12.2
Original Model Size	~60GB
Quantized Model Size	~30GB

Quantization Recipe

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head]
      scheme: FP8_DYNAMIC

Quick Start with Docker

The easiest way to run this model. No setup required - just Docker with NVIDIA runtime.

Docker Compose (Recommended)

# Download docker-compose.yml
wget https://huggingface.co/Doradus/MiroThinker-v1.0-30B-FP8/raw/main/docker/docker-compose.yml

# Run with 2 GPUs (recommended)
docker compose up

# Or single GPU (not recommended - poor performance)
SINGLE_GPU=1 docker compose up

Docker Run

# TP=2 with 2 GPUs (recommended)
docker run --gpus all -p 8000:8000 \
  -v hf_cache:/root/.cache/huggingface \
  --shm-size=16g \
  vllm/vllm-openai:v0.11.2 \
  --model Doradus/MiroThinker-v1.0-30B-FP8 \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --trust-remote-code

# Single GPU fallback (expect ~1-2 tok/s)
docker run --gpus '"device=0"' -p 8000:8000 \
  -v hf_cache:/root/.cache/huggingface \
  --shm-size=16g \
  vllm/vllm-openai:v0.11.2 \
  --model Doradus/MiroThinker-v1.0-30B-FP8 \
  --tensor-parallel-size 1 \
  --max-model-len 2048 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.95 \
  --enforce-eager \
  --trust-remote-code

Test the API

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Doradus/MiroThinker-v1.0-30B-FP8",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

Usage

vLLM (Recommended)

python -m vllm.entrypoints.openai.api_server \
  --model Doradus/MiroThinker-v1.0-30B-FP8 \
  --tensor-parallel-size 2 \
  --max-model-len 131072 \
  --trust-remote-code

SGLang

python -m sglang.launch_server \
  --model-path Doradus/MiroThinker-v1.0-30B-FP8 \
  --host 0.0.0.0 \
  --port 8000 \
  --tp 2

Transformers (for inspection only)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Doradus/MiroThinker-v1.0-30B-FP8",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("Doradus/MiroThinker-v1.0-30B-FP8")

Recommended Inference Parameters

For optimal performance in agentic tasks (from the original MiroThinker documentation):

temperature = 1.0
top_p = 0.95
repetition_penalty = 1.05
max_context_length = 262144
max_tokens = 16384

Architecture Details

This is a Mixture of Experts (MoE) model based on Qwen3MoE architecture:

Property	Value
Total Parameters	~30B (all experts)
Active Parameters	~3.3B per forward pass
Hidden Size	2048
Attention Heads	32
KV Heads (GQA)	4
Layers	48
Experts	128 total, 8 active per token
MoE Intermediate Size	768 per expert
Max Context	262,144 tokens
Vocabulary	151,936 tokens

Hardware Requirements

VRAM Analysis

Model weights: 30GB (vs 57GB BF16 original)

Context Length	KV Cache (FP16)	Total VRAM	Fits Single GPU?
2K tokens	~0.1 GB	~31 GB	RTX 5090 (tight)
4K tokens	~0.2 GB	~31 GB	RTX 5090 (tight)
8K tokens	~0.4 GB	~31 GB	RTX 5090
16K tokens	~0.8 GB	~32 GB	A100-40GB
32K tokens	~1.6 GB	~33 GB	A100-40GB
64K tokens	~3.2 GB	~35 GB	A100-80GB
131K tokens	~6.4 GB	~38 GB	A100-80GB / H100
262K tokens	~12.8 GB	~45 GB	H100 or TP=2

KV cache calculated for GQA with 4 KV heads, 128 head_dim, 48 layers, FP16 KV

Recommended Configurations

GPU Setup	Max Context	Performance	Notes
1x RTX 4090 (24GB)	OOM	N/A	Model too large
1x RTX 5090 (32GB)	~2K tokens	~1-2 tok/s	Requires `--enforce-eager`, not recommended
2x RTX 4090 (24GB) TP=2	~16K tokens	~60 tok/s	Recommended consumer config
2x RTX 5090 (32GB) TP=2	~32K tokens	~80 tok/s	Recommended consumer config
1x A100-40GB	~8K tokens	~40 tok/s	Single GPU possible
2x A100-40GB TP=2	~64K tokens	~80 tok/s	Good production config
1x A100-80GB	~131K tokens	~60 tok/s	TP=1 possible
1x H100-80GB	~262K tokens	~120 tok/s	Full context, TP=1

Single 32GB GPU Limitations

The model weights alone require 29.2 GiB, leaving minimal headroom for KV cache on a 32GB GPU. Single RTX 5090 operation is technically possible but not recommended for production:

Requires --enforce-eager (disables CUDA graphs, significant performance penalty)
Maximum context: ~2048 tokens
Throughput: ~1-2 tokens/second (severely memory-bound)
No headroom for batched requests

If you only have a single 32GB GPU, this configuration will work but with poor performance:

python -m vllm.entrypoints.openai.api_server \
  --model Doradus/MiroThinker-v1.0-30B-FP8 \
  --max-model-len 2048 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.95 \
  --enforce-eager \
  --trust-remote-code

Strongly recommended: Use TP=2 with two 24GB+ GPUs for usable performance.

Note: FP8 inference requires CUDA compute capability 8.9+ (Ada Lovelace) or 9.0+ (Hopper) for optimal performance. On older GPUs, the model will run but may use fallback kernels.

Quality & Performance

FP8 vs BF16 Comparison

Metric	BF16 Original	FP8 Quantized	Delta
Model Size	57 GB	30 GB	-47%
Load Time	~45s	~25s	-44%
Memory BW	~2x	1x (baseline)	FP8 wins

Expected Quality Retention

FP8 dynamic quantization (W8A8) typically preserves >99% of model quality for reasoning tasks. The lm_head is kept in BF16 to maintain output distribution fidelity.

Why FP8 Dynamic?

No calibration data needed (faster quantization)
Dynamic activation quantization adapts per-token
E4M3 format balances range and precision well for LLMs
Native hardware support on Ada/Hopper (no overhead)

Original Model Benchmarks (from arXiv paper)

MiroThinker is an agentic research model designed for multi-turn tool use, not traditional LLM benchmarks. The original BF16 model was evaluated on agent-specific benchmarks requiring up to 600 tool calls per task:

Benchmark	MiroThinker-30B (BF16)	Description
GAIA	~70%	General AI Assistant (tool use)
BrowseComp	~40%	Web browsing comprehension
BrowseComp-ZH	~50%	Chinese web browsing
HLE-Text	~30%	Humanity's Last Exam

*Scores from paper scaling analysis

Note on FP8 quality: These agentic benchmarks require full agent infrastructure (browser, tools, multi-turn execution) and cannot be directly run on the quantized model in isolation. However, FP8 W8A8 dynamic quantization typically preserves >99% of model quality based on extensive research (Neural Magic, vLLM benchmarks).

Supplementary Benchmarks (lm-evaluation-harness)

For reference, we ran traditional LLM benchmarks, though these don't reflect the model's primary use case:

Benchmark	Metric	Score	Notes
IFEval	Instruction-level (loose)	46.0%	Instruction following
IFEval	Instruction-level (strict)	44.2%	Strict compliance
GSM8K (5-shot)	Exact Match (flexible)	18.0%	Not optimized for this

The GSM8K scores reflect the model's <think> block reasoning behavior (suited for agentic tasks) rather than direct answer generation.

Supplementary benchmarks run 2025-12-03 using lm-evaluation-harness

Measured Throughput (RTX PRO 6000 Blackwell, 96GB)

Tested on vLLM with TP=2, 32K max context:

Test Type	Tokens Generated	Time	Throughput
Short reasoning	100	0.83s	119.9 tok/s
Code generation	256	2.1s	121.7 tok/s
Long explanation	512	4.24s	120.8 tok/s
Average	868	7.17s	121.1 tok/s

VRAM Usage: ~45GB per GPU (TP=2) at 32K context

Tested 2025-12-03 on Doradus infrastructure with vLLM 0.11.x

Reproduction

To reproduce this quantization:

#!/usr/bin/env python3
"""
Quantize MiroThinker-30B to FP8 using llmcompressor (Neural Magic)
Dynamic quantization - no calibration data needed, fast conversion
Output is vLLM-compatible FP8
"""

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
import torch

MODEL_PATH = "miromind-ai/MiroThinker-v1.0-30B"
OUTPUT_PATH = "./MiroThinker-v1.0-30B-FP8"

recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["lm_head"],
)

oneshot(
    model=MODEL_PATH,
    output_dir=OUTPUT_PATH,
    recipe=recipe,
    num_calibration_samples=0,
    save_compressed=True,
)

Requirements:

pip install llmcompressor torch transformers accelerate

Original Model

This quantization is based on miromind-ai/MiroThinker-v1.0-30B.

MiroThinker v1.0 is an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Key features:

256K context window
Up to 600 tool calls per task
Interactive scaling via RL training
Strong performance on HLE-Text, BrowseComp, GAIA benchmarks

For full details, see the MiroThinker paper and GitHub repository.

License

This model inherits the MIT License from the original MiroThinker model.

Citation

If you use this model, please cite the original MiroThinker paper:

@article{miromind2025mirothinker,
  title={MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling},
  author={MiroMind Team and Bai, Song and Bing, Lidong and Chen, Carson and Chen, Guanzheng and Chen, Yuntao and Chen, Zhe and Chen, Ziyi and Dai, Jifeng and Dong, Xuan and others},
  journal={arXiv preprint arXiv:2511.11793},
  year={2025}
}

Acknowledgements

MiroMind AI for the original MiroThinker model
Neural Magic / vLLM for llmcompressor
DoradusAI for the quantization