π’ INT4 vs FP4: The Future of 4-Bit Quantization
How Kimi Stole the Show
4-bit quantization has been simmering in research labs for years. The promise was always clear: cut model size by 4Γ (from BF16/FP16), preserve most quality, democratize deployment of large models.
When Nvidia unveiled Blackwell in early 2024, NVFP4 was a centerpiece featureβa proper 4-bit floating-point format designed specifically for neural networks. Unlike INT4's uniform integer grid, FP4 brings the dynamic range advantages of floating-point to 4-bit precision. More precision near zero where weights cluster, wider range for outliers, better numerical properties overall.
FP4 is objectively superior to INT4. The format gives you:
- Dynamic range: Sign bit + 2-bit exponent + 1-bit mantissa
- Better representation near zero: Where most weights live
- Proper handling of outliers: Without clipping or overflow
- Hardware acceleration: Designed for Tensor Cores on Blackwell
Then Kimi dropped the K2 Thinking model, in INT4.
Why INT4
The reasons are simple.
Kimi doesn't have Blackwell GPUs. This is true for other Chinese model makers. Due to export control, their training cluster should mainly consist of Ampere (A800) and Hopper (H800). Ampere has basic INT4 support through its third-generation Tensor Cores. For Hopper, INT4 training works through software-based quantization methods like AWQ (Activation-aware Weight Quantization), not native hardware acceleration.
Their customers don't have Blackwell GPUs either.
The Ampere series are still mainstream in China, popular in US too. As such, it makes total sense to optimize your model around INT4, where the ecosystem is centered at.
Blackwell cannot inference Kimi's INT4 model. This is the ironic part, because Nvidia went all-in on NVFP4. The architecture simply lacks the tensor core instructions for native INT4. You can run INT4 models through software emulation, but you lose the performance gains that made Blackwell compelling in the first place.
The Nuance of Conversion
Here is another idea for Blackwell. Take the Kimi model, and convert from INT4 to FP4. This is effectively PTQ (post-training quantization), which is lossy.
What Gets Lost
QAT Training Path (INT4):
βββββββββββββββββββββββββββββββββββββββββββββββββββ
BF16 activations β [Quantize] β INT4 β Forward pass
β
ββ> Model learns INT4's specific grid spacing
ββ> Weights cluster at INT4's quantization levels
ββ> Error compensation tuned for INT4's properties
All intermediate training trajectories: shaped by INT4
PTQ Conversion (INT4 β FP4):
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Trained INT4 weights β [Direct format conversion] β FP4
β
Lost: the BF16βINT4 training trajectories
Lost: the quantization-aware adaptations
Lost: the error compensation learned during training
Result: Suboptimal FP4 model
The Distribution Mismatch
INT4 and FP4 have fundamentally different value distributions:
INT4: Uniform spacing
βββββββββββββββββββββββββββββββββββββββββ
-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7
- β’ β’ β’ β’ β’ β’ β’ β’ β’ β’ β’ β’ β’ β’ β’
βββββββββββββββββββ¬ββββββββββββββββββ
Equal spacing everywhere
FP4: Exponential spacing (more precision near zero)
βββββββββββββββββββββββββββββββββββββββββ
Near zero: dense spacing
β β
-8.0 -4.0 -2.0 -1.0 -0.5 0 0.5 1.0 2.0 4.0 8.0
- β’ β’ β’ β’ β’ β’ β’ β’ β’ β’
ββββ βββ ββ ββββββββ ββββββββ ββ βββ ββββ
Wider Medium Fine Fine Medium Wider
When you PTQ from INT4 to FP4, you're trying to map a uniform grid onto an exponential one. Values that were stable at INT4's quantization levels will land between FP4's levels. Vice versa if converting FP4 to INT4.
What If Kimi Were Trained on Blackwell?
Here is the sad part: the resulting model would have been even stronger than what we have today. As explained above, FP4 is newer and better than INT4, and was just designed for this purpose.
How the Future Will Unfold
1. FP4 Will Eventually Dominate
This is inevitable. FP4 is technically superiorβbetter numerical properties, wider dynamic range, designed by people who understand neural network weight distributions. Once Blackwell reaches critical mass globally, FP4-native models will dominate too.
The timeline? It is hard to say. We only know this for sure: a great many training talents cannot access it today.
2. The Ampere Renaissance
Here's the interesting twist: Ampere and Hopper stay relevant longer than Nvidia's roadmap implied.
Why? Chinese models trained for INT4 create a large ecosystem of inference workloads optimized for pre-Blackwell hardware. If you are a neocloud or hyperscaler sitting on many H100s, Kimi just made them valuable for at least another 2 years.
Big tech's GPU depreciation narrative improves. CFOs are happy. The urgency to upgrade dwindles. INT4-native Chinese models provide a compelling reason to keep older hardware in production rather than writing it off.
3. Challenges for Blackwell
On the other hand, if the hottest open-source models can't be inferenced on Blackwell without performance loss, the drive to hardware upgrade will weaken. Also pressure is mounting on the other side of the fence. For those training talents who do have access to Blackwell, are you up to the challenge to train a SOTA model on FP4?
4. The Next Chapter
FP4 will eventually replace INT4. Innovations didn't slow down. They always find a new path. This tension between Chinese models and American hardware fragments the ecosystem which could have been simpler. It creates interesting dynamics which I touched upon in this post.
This pattern will continue. A few months or quarters down the road, another story will emerge.
References:
- Nvidia Hopper Architecture In-Depth
- Nvidia Tensor Core Evolution: From Volta to Blackwell
- Tom's Hardware: Nvidia Shares Blackwell Ultra's Secrets
- Kimi K2 Technical Report
- NVFP4 Quantization Guide - DGX Spark
- US-China GPU Export Controls Timeline
- SemiAnalysis: 2025 AI Diffusion Export Controls
- QAT: The Art of Growing a Bonsai Model