EAGLE3 Draft Model for GLM-4.7-Flash

GLM-4.7-Flash-Eagle3 is an EAGLE3 draft model trained for speculative decoding with GLM-4.7-Flash. It enables faster inference by predicting multiple future tokens in parallel, which are then verified by the target model in a single forward pass.

Version: 1.0 Release Date: 2026-02-16 Organization: ThoughtWorks License: apache-2.0

Model Overview

This EAGLE3 draft model accelerates inference for zai-org/GLM-4.7-Flash through speculative decoding. The draft model predicts multiple tokens ahead, achieving 1.39× TPOT speedup for single requests and 1.70× throughput improvement under concurrent load.

Target Model: zai-org/GLM-4.7-Flash - Mixture-of-Experts language model with 3B active parameters Draft Model Size: 277.4 MB Architecture: 1-layer transformer with 2048 hidden dimensions

Key Features

FlashInfer Compatible: head_dim=128 ✓
Acceptance Rate: 40.0% (MT-Bench, B=1)
Speedup: 1.39× TPOT (B=1), 1.70× throughput (B=32)
Hardware: Optimized for single GPU (TP=1) deployment

Architecture Specifications

Parameter	Value
Hidden Size	2048
Attention Heads	16
KV Heads (GQA)	4
Head Dimension	128
Intermediate Size	8192
Layers	1
Vocabulary Size	154880
Draft Vocab Size	32000

Note: Hidden size matches target model (GLM-4.7-Flash) for embedding weight sharing.

Training Details

Dataset

Mixed Diversity — 54K samples

Composition:

45% ShareGPT
35% UltraChat
20% PerfectBlend

Average tokens per sample: 1300

Hyperparameters

Parameter	Value
Epochs	3
Batch Size	1
Learning Rate	1e-4
Warmup Ratio	0.03
Max Length	1024

Training Results

Training Acceptance Rate: 79.2% at position k=0 (first draft token; inference average across all 6 positions is ~40%)

Benchmark Results

Dataset: MT-Bench (154 prompts, max_tokens=512, temperature=0.7) Hardware: Single NVIDIA H100 (79GB), TP=1 Backend: FlashInfer Spec Config: num_steps=3, num_draft_tokens=6, eagle_topk=4

Metric Definitions

Acceptance Rate: Percentage of draft tokens accepted by target model, averaged across all verification steps (NOT position-specific). Example: 40% = 2.4 out of 6 predicted tokens accepted on average.
Acceptance Length: Average number of consecutive draft tokens accepted per verification step (directly determines speedup).
TTFT: Time To First Token (prefill latency) in milliseconds
TPOT: Time Per Output Token (decode latency) in milliseconds
Throughput: Tokens generated per second

Batch Size 1 (Single Request - Latency Optimization)

Server-Side Metrics (Prometheus — Ground Truth)

Metric	Baseline	EAGLE3	Speedup
TTFT (ms)	76.1	74.74	1.02×
TPOT (ms)	8.18	5.89	1.39×
Throughput (tok/s)	120.3	167.75	1.39×
Acceptance Rate (%)	—	40.0%	—
Acceptance Length	—	2.4	—

Batch Size 32 (Concurrent Load - Throughput Optimization)

Server-Side Metrics (Prometheus — Ground Truth)

Metric	Baseline	EAGLE3	Speedup
TTFT (ms)	2988	3210	0.93×
TPOT (ms)	22.57	17.33	1.30×
Throughput (tok/s)	258.61	440.15	1.70×
Acceptance Rate (%)	—	40.0%†	—
Acceptance Length	—	2.4†	—

†Same server session as B=1; concurrent benchmark does not collect per-request accept stats.

Key Insight: Batch size 1 optimizes for interactive latency (TPOT matters most), while batch size 32 optimizes for serving capacity (throughput matters most).

Usage

Installation

pip install sglang transformers

Basic Usage

python -m sglang.launch_server \
  --model-path zai-org/GLM-4.7-Flash \
  --speculative-algorithm EAGLE3 \
  --speculative-draft-model-path thoughtworks/GLM-4.7-Flash-Eagle3 \
  --speculative-num-steps 3 \
  --speculative-num-draft-tokens 6 \
  --speculative-eagle-topk 4 \
  --tp 1 \
  --trust-remote-code \
  --port 30000 \
  --enable-metrics

Python API

import requests

response = requests.post(
    "http://localhost:30000/v1/chat/completions",
    json={
        "model": "default",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 100,
        "temperature": 0.7,
    }
)
print(response.json())

Performance Tips

Backend Selection: Use FlashInfer backend (default) for optimal performance
Tuning: Adjust num_draft_tokens based on workload (3-6 recommended)
Monitoring: Enable --enable-metrics flag and monitor /metrics endpoint for acceptance rates
Validation: Verify acceptance rate > 0% after server startup to confirm draft model loaded correctly

Limitations

Requires SGLang backend with EAGLE3 support
Optimized for TP=1 inference (single GPU deployment)
FlashInfer backend recommended for optimal performance

Citation

@misc{glm_4.7_flash_eagle3_2026,
  title={EAGLE3 Draft Model for GLM-4.7-Flash},
  author={ThoughtWorks},
  year={2026},
  howpublished={\url{https://huggingface.co/thoughtworks/GLM-4.7-Flash-Eagle3}},
}

EAGLE3 Paper

@article{wang2025eagle3,
  title={EAGLE-3: Lossless Acceleration of LLM Decoding by Adaptive Draft Heads},
  author={Wang, Yuhui and others},
  journal={arXiv preprint arXiv:2503.01840},
  year={2025}
}

Additional Resources

Target Model: zai-org/GLM-4.7-Flash

License

apache-2.0

Contact

For questions or issues, open a discussion on the model page.

Downloads last month: 15

Safetensors

Model size

0.1B params

Tensor type

I64

BF16

BOOL

Paper for thoughtworks/GLM-4.7-Flash-Eagle3

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Paper • 2503.01840 • Published Mar 3, 2025 • 5