A newer version of the Gradio SDK is available:
6.1.0
title: Transformer Edge Optimization
emoji: π
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
tags:
- quantization
- optimization
- edge-ai
- mobile
- transformers
- onnx
- sentiment-analysis
duplicated_from: null
π Transformer Edge Optimization Demo
Interactive demo comparing Original vs Quantized transformer models
Try Demo β’ GitHub Repo β’ Notebooks
π― What Does This Demo Do?
This interactive demo showcases model quantization - a technique to make AI models smaller and faster for mobile/edge devices.
Try It:
- Quick Prediction - Test sentiment analysis with quantized model
- Model Comparison - Compare Original (FP32) vs Quantized (INT8) side by side
- Documentation - Learn about the techniques
β¨ Key Results
| Metric | Original | Quantized | Improvement |
|---|---|---|---|
| Size | 255 MB | 68 MB | 3.75x smaller β¬οΈ |
| Speed | 12.3 ms | 5.8 ms | 2.1x faster β‘ |
| Accuracy | 91.8% | 90.2% | -1.6% π |
Conclusion: Nearly 4x smaller model with 2x faster inference and only 1.6% accuracy loss!
π§ͺ What is Quantization?
Quantization reduces model size by converting weights from 32-bit floating point (FP32) to 8-bit integers (INT8).
How It Works:
import torch
from transformers import AutoModelForSequenceClassification
# Load model
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
)
# Quantize: FP32 β INT8
quantized = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
# Now 4x smaller! π
Why Quantization?
- β Smaller models - Fit on mobile devices
- β Faster inference - Better user experience
- β Lower power - Longer battery life
- β Easy to implement - Post-training, no retraining
π Optimization Techniques
This project demonstrates 3 major techniques:
1. Quantization (This Demo)
- Compression: 4x
- Speed: 2-3x faster
- Difficulty: β Easy
2. ONNX Runtime
- Compression: 3.8x
- Speed: 2.2x faster
- Difficulty: ββ Medium
- Benefit: Cross-platform deployment
3. Knowledge Distillation
- Compression: 6-10x
- Speed: 3x faster
- Difficulty: βββ Advanced
- Benefit: Student model learns from teacher
π Try The Full Toolkit
Interactive Notebooks (Google Colab):
1. Quantization Basics (15 minutes)
Learn:
- Dynamic quantization
- Static quantization
- Model size comparison
- Performance benchmarking
2. ONNX Runtime Optimization (20 minutes)
Learn:
- PyTorch β ONNX conversion
- Hugging Face Optimum
- Cross-platform deployment
- Hardware acceleration
3. Knowledge Distillation (30 minutes)
Learn:
- Teacher-student training
- Distillation loss
- Creating tiny models
- BERT β TinyBERT
π» Use Cases
π± Mobile Apps
// Android with TFLite
val analyzer = SentimentAnalyzer(context)
val result = analyzer.predict("Great app!")
π Web Apps
// Browser with Transformers.js
import { pipeline } from '@xenova/transformers';
const classifier = await pipeline('sentiment-analysis');
π€ Edge Devices
# Raspberry Pi with ONNX Runtime
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
π Full Documentation
GitHub Repository
mtkaya/transformer-edge-optimization
Contains:
- β 3 Jupyter notebooks
- β Example code (Python, Kotlin, JavaScript)
- β Comprehensive documentation
- β CI/CD pipeline
- β Docker support
Quick Links:
π Technical Details
Model Used:
DistilBERT fine-tuned on SST-2 (Stanford Sentiment Treebank)
- Base Model:
distilbert-base-uncased-finetuned-sst-2-english - Parameters: 67M
- Task: Binary sentiment classification (Positive/Negative)
Quantization Approach:
Dynamic Quantization with PyTorch
- Weights: INT8 (8-bit integers)
- Activations: FP32 (computed at runtime)
- Method:
torch.quantization.quantize_dynamic()
Benchmark Hardware:
- CPU: Intel Xeon (Colab)
- Input: 128 tokens average
- Iterations: 100 runs per test
π Detailed Benchmark
Model Size:
Original (FP32): 255.43 MB
Quantized (INT8): 68.12 MB
Compression Ratio: 3.75x
Space Saved: 187.31 MB (73.3%)
Inference Speed (CPU):
Original: 12.34 Β± 0.45 ms
Quantized: 5.78 Β± 0.23 ms
Speedup: 2.13x
Time Saved: 6.56 ms per inference (53.2%)
Accuracy (SST-2 Test Set):
Original: 91.8% accuracy
Quantized: 90.2% accuracy
Difference: -1.6%
Memory Usage:
Original: ~280 MB
Quantized: ~95 MB
Reduction: 2.95x
π Features of This Demo
π― Quick Prediction
- Enter any text
- Toggle between Original/Quantized
- See prediction + confidence + model info
βοΈ Model Comparison
- Side-by-side comparison
- Same input, both models
- Performance metrics
π Documentation
- Learn about quantization
- See benchmark results
- Access notebooks
- Quick start code
π€ Contributing
We welcome contributions! Check out:
- GitHub Issues: Report bugs
- Discussions: Ask questions
- Pull Requests: Contribute code
π License
This project is licensed under the MIT License.
See LICENSE for details.
π Acknowledgments
Built with:
Inspired by:
- DistilBERT paper (Sanh et al., 2019)
- Q8BERT (Zafrir et al., 2021)
π§ Contact
- GitHub: @mtkaya
- Issues: Report here
β Star the repo if you find this useful! β
GitHub Repository β’ Documentation β’ Notebooks
Made with β€οΈ for the AI community