GLM-4.7-Flash is fast, good and cheap. 3,074 tokens/sec peak at 200k tokens context window on my desktop PC. Works with Claude Code and opencode for hours. No errors, drop-in replacement of the Anthropic cloud AI. MIT licensed, open weights, free for commercial use and modifications. Supports speculative decoding using MTP, which is highly effective in mitigating latency. Great for on device AI coding as AWQ 4bit at 18.5 GB. Hybrid inference on a single consumer GPU + CPU RAM.
I just stress-tested the Beast: MiniMax-M2.1 on Z8 Fury G5. 2101 tokens/sec. FORTY concurrent clients. That's 609 t/s out, 1492 t/s in. The model outputs fire faster than I can type, but feeds on data like a black hole on cheat day. But wait, there's more! Threw it into Claude Code torture testing with 60+ tools, 8 agents (7 sub-agents because apparently one wasn't enough chaos). It didn't even flinch. Extremely fast, scary good at coding. The kind of performance that makes you wonder if the model's been secretly reading Stack Overflow in its spare time lol 3 months ago, these numbers lived in my "maybe in “2030 dreams. Today it's running on my desk AND heaths my home office during the winter!