EAGLE3 Draft Model for GLM-4.7-Flash

GLM-4.7-Flash-Eagle3 is an EAGLE3 draft model trained for speculative decoding with GLM-4.7-Flash. It enables faster inference by predicting multiple future tokens in parallel, which are then verified by the target model in a single forward pass.

Version: 1.0 Release Date: 2026-02-16 Organization: ThoughtWorks License: apache-2.0


Model Overview

This EAGLE3 draft model accelerates inference for zai-org/GLM-4.7-Flash through speculative decoding. The draft model predicts multiple tokens ahead, achieving 1.39× TPOT speedup for single requests and 1.70× throughput improvement under concurrent load.

Target Model: zai-org/GLM-4.7-Flash - Mixture-of-Experts language model with 3B active parameters Draft Model Size: 277.4 MB Architecture: 1-layer transformer with 2048 hidden dimensions

Key Features

  • FlashInfer Compatible: head_dim=128 ✓
  • Acceptance Rate: 40.0% (MT-Bench, B=1)
  • Speedup: 1.39× TPOT (B=1), 1.70× throughput (B=32)
  • Hardware: Optimized for single GPU (TP=1) deployment

Architecture Specifications

Parameter Value
Hidden Size 2048
Attention Heads 16
KV Heads (GQA) 4
Head Dimension 128
Intermediate Size 8192
Layers 1
Vocabulary Size 154880
Draft Vocab Size 32000

Note: Hidden size matches target model (GLM-4.7-Flash) for embedding weight sharing.


Training Details

Dataset

Mixed Diversity — 54K samples

Composition:

  • 45% ShareGPT
  • 35% UltraChat
  • 20% PerfectBlend

Average tokens per sample: 1300

Hyperparameters

Parameter Value
Epochs 3
Batch Size 1
Learning Rate 1e-4
Warmup Ratio 0.03
Max Length 1024

Training Results

  • Training Acceptance Rate: 79.2% at position k=0 (first draft token; inference average across all 6 positions is ~40%)

Benchmark Results

Dataset: MT-Bench (154 prompts, max_tokens=512, temperature=0.7) Hardware: Single NVIDIA H100 (79GB), TP=1 Backend: FlashInfer Spec Config: num_steps=3, num_draft_tokens=6, eagle_topk=4

Metric Definitions

  • Acceptance Rate: Percentage of draft tokens accepted by target model, averaged across all verification steps (NOT position-specific). Example: 40% = 2.4 out of 6 predicted tokens accepted on average.
  • Acceptance Length: Average number of consecutive draft tokens accepted per verification step (directly determines speedup).
  • TTFT: Time To First Token (prefill latency) in milliseconds
  • TPOT: Time Per Output Token (decode latency) in milliseconds
  • Throughput: Tokens generated per second

Batch Size 1 (Single Request - Latency Optimization)

Server-Side Metrics (Prometheus — Ground Truth)

Metric Baseline EAGLE3 Speedup
TTFT (ms) 76.1 74.74 1.02×
TPOT (ms) 8.18 5.89 1.39×
Throughput (tok/s) 120.3 167.75 1.39×
Acceptance Rate (%) — 40.0% —
Acceptance Length — 2.4 —

Batch Size 32 (Concurrent Load - Throughput Optimization)

Server-Side Metrics (Prometheus — Ground Truth)

Metric Baseline EAGLE3 Speedup
TTFT (ms) 2988 3210 0.93×
TPOT (ms) 22.57 17.33 1.30×
Throughput (tok/s) 258.61 440.15 1.70×
Acceptance Rate (%) — 40.0%† —
Acceptance Length — 2.4† —

†Same server session as B=1; concurrent benchmark does not collect per-request accept stats.

Key Insight: Batch size 1 optimizes for interactive latency (TPOT matters most), while batch size 32 optimizes for serving capacity (throughput matters most).


Usage

Installation

pip install sglang transformers

Basic Usage

python -m sglang.launch_server \
  --model-path zai-org/GLM-4.7-Flash \
  --speculative-algorithm EAGLE3 \
  --speculative-draft-model-path thoughtworks/GLM-4.7-Flash-Eagle3 \
  --speculative-num-steps 3 \
  --speculative-num-draft-tokens 6 \
  --speculative-eagle-topk 4 \
  --tp 1 \
  --trust-remote-code \
  --port 30000 \
  --enable-metrics

Python API

import requests

response = requests.post(
    "http://localhost:30000/v1/chat/completions",
    json={
        "model": "default",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 100,
        "temperature": 0.7,
    }
)
print(response.json())

Performance Tips

  1. Backend Selection: Use FlashInfer backend (default) for optimal performance
  2. Tuning: Adjust num_draft_tokens based on workload (3-6 recommended)
  3. Monitoring: Enable --enable-metrics flag and monitor /metrics endpoint for acceptance rates
  4. Validation: Verify acceptance rate > 0% after server startup to confirm draft model loaded correctly

Limitations

  • Requires SGLang backend with EAGLE3 support
  • Optimized for TP=1 inference (single GPU deployment)
  • FlashInfer backend recommended for optimal performance

Citation

@misc{glm_4.7_flash_eagle3_2026,
  title={EAGLE3 Draft Model for GLM-4.7-Flash},
  author={ThoughtWorks},
  year={2026},
  howpublished={\url{https://huggingface.co/thoughtworks/GLM-4.7-Flash-Eagle3}},
}

EAGLE3 Paper

@article{wang2025eagle3,
  title={EAGLE-3: Lossless Acceleration of LLM Decoding by Adaptive Draft Heads},
  author={Wang, Yuhui and others},
  journal={arXiv preprint arXiv:2503.01840},
  year={2025}
}

Additional Resources


License

apache-2.0


Contact

For questions or issues, open a discussion on the model page.

Downloads last month
15
Safetensors
Model size
0.1B params
Tensor type
I64
·
BF16
·
BOOL
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for thoughtworks/GLM-4.7-Flash-Eagle3