EAGLE3 Draft Model for GLM-4.7-Flash
GLM-4.7-Flash-Eagle3 is an EAGLE3 draft model trained for speculative decoding with GLM-4.7-Flash. It enables faster inference by predicting multiple future tokens in parallel, which are then verified by the target model in a single forward pass.
Version: 1.0 Release Date: 2026-02-16 Organization: ThoughtWorks License: apache-2.0
Model Overview
This EAGLE3 draft model accelerates inference for zai-org/GLM-4.7-Flash through speculative decoding. The draft model predicts multiple tokens ahead, achieving 1.39× TPOT speedup for single requests and 1.70× throughput improvement under concurrent load.
Target Model: zai-org/GLM-4.7-Flash - Mixture-of-Experts language model with 3B active parameters Draft Model Size: 277.4 MB Architecture: 1-layer transformer with 2048 hidden dimensions
Key Features
- FlashInfer Compatible: head_dim=128 ✓
- Acceptance Rate: 40.0% (MT-Bench, B=1)
- Speedup: 1.39× TPOT (B=1), 1.70× throughput (B=32)
- Hardware: Optimized for single GPU (TP=1) deployment
Architecture Specifications
| Parameter | Value |
|---|---|
| Hidden Size | 2048 |
| Attention Heads | 16 |
| KV Heads (GQA) | 4 |
| Head Dimension | 128 |
| Intermediate Size | 8192 |
| Layers | 1 |
| Vocabulary Size | 154880 |
| Draft Vocab Size | 32000 |
Note: Hidden size matches target model (GLM-4.7-Flash) for embedding weight sharing.
Training Details
Dataset
Mixed Diversity — 54K samples
Composition:
- 45% ShareGPT
- 35% UltraChat
- 20% PerfectBlend
Average tokens per sample: 1300
Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 3 |
| Batch Size | 1 |
| Learning Rate | 1e-4 |
| Warmup Ratio | 0.03 |
| Max Length | 1024 |
Training Results
- Training Acceptance Rate: 79.2% at position k=0 (first draft token; inference average across all 6 positions is ~40%)
Benchmark Results
Dataset: MT-Bench (154 prompts, max_tokens=512, temperature=0.7) Hardware: Single NVIDIA H100 (79GB), TP=1 Backend: FlashInfer Spec Config: num_steps=3, num_draft_tokens=6, eagle_topk=4
Metric Definitions
- Acceptance Rate: Percentage of draft tokens accepted by target model, averaged across all verification steps (NOT position-specific). Example: 40% = 2.4 out of 6 predicted tokens accepted on average.
- Acceptance Length: Average number of consecutive draft tokens accepted per verification step (directly determines speedup).
- TTFT: Time To First Token (prefill latency) in milliseconds
- TPOT: Time Per Output Token (decode latency) in milliseconds
- Throughput: Tokens generated per second
Batch Size 1 (Single Request - Latency Optimization)
Server-Side Metrics (Prometheus — Ground Truth)
| Metric | Baseline | EAGLE3 | Speedup |
|---|---|---|---|
| TTFT (ms) | 76.1 | 74.74 | 1.02× |
| TPOT (ms) | 8.18 | 5.89 | 1.39× |
| Throughput (tok/s) | 120.3 | 167.75 | 1.39× |
| Acceptance Rate (%) | — | 40.0% | — |
| Acceptance Length | — | 2.4 | — |
Batch Size 32 (Concurrent Load - Throughput Optimization)
Server-Side Metrics (Prometheus — Ground Truth)
| Metric | Baseline | EAGLE3 | Speedup |
|---|---|---|---|
| TTFT (ms) | 2988 | 3210 | 0.93× |
| TPOT (ms) | 22.57 | 17.33 | 1.30× |
| Throughput (tok/s) | 258.61 | 440.15 | 1.70× |
| Acceptance Rate (%) | — | 40.0%†| — |
| Acceptance Length | — | 2.4†| — |
†Same server session as B=1; concurrent benchmark does not collect per-request accept stats.
Key Insight: Batch size 1 optimizes for interactive latency (TPOT matters most), while batch size 32 optimizes for serving capacity (throughput matters most).
Usage
Installation
pip install sglang transformers
Basic Usage
python -m sglang.launch_server \
--model-path zai-org/GLM-4.7-Flash \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path thoughtworks/GLM-4.7-Flash-Eagle3 \
--speculative-num-steps 3 \
--speculative-num-draft-tokens 6 \
--speculative-eagle-topk 4 \
--tp 1 \
--trust-remote-code \
--port 30000 \
--enable-metrics
Python API
import requests
response = requests.post(
"http://localhost:30000/v1/chat/completions",
json={
"model": "default",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100,
"temperature": 0.7,
}
)
print(response.json())
Performance Tips
- Backend Selection: Use FlashInfer backend (default) for optimal performance
- Tuning: Adjust
num_draft_tokensbased on workload (3-6 recommended) - Monitoring: Enable
--enable-metricsflag and monitor/metricsendpoint for acceptance rates - Validation: Verify acceptance rate > 0% after server startup to confirm draft model loaded correctly
Limitations
- Requires SGLang backend with EAGLE3 support
- Optimized for TP=1 inference (single GPU deployment)
- FlashInfer backend recommended for optimal performance
Citation
@misc{glm_4.7_flash_eagle3_2026,
title={EAGLE3 Draft Model for GLM-4.7-Flash},
author={ThoughtWorks},
year={2026},
howpublished={\url{https://huggingface.co/thoughtworks/GLM-4.7-Flash-Eagle3}},
}
EAGLE3 Paper
@article{wang2025eagle3,
title={EAGLE-3: Lossless Acceleration of LLM Decoding by Adaptive Draft Heads},
author={Wang, Yuhui and others},
journal={arXiv preprint arXiv:2503.01840},
year={2025}
}
Additional Resources
- Target Model: zai-org/GLM-4.7-Flash
License
apache-2.0
Contact
For questions or issues, open a discussion on the model page.
- Downloads last month
- 15