|
|
--- |
|
|
title: Transformer Edge Optimization |
|
|
emoji: π |
|
|
colorFrom: blue |
|
|
colorTo: purple |
|
|
sdk: gradio |
|
|
sdk_version: 5.49.1 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
license: mit |
|
|
tags: |
|
|
- quantization |
|
|
- optimization |
|
|
- edge-ai |
|
|
- mobile |
|
|
- transformers |
|
|
- onnx |
|
|
- sentiment-analysis |
|
|
duplicated_from: null |
|
|
--- |
|
|
|
|
|
# π Transformer Edge Optimization Demo |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
[](https://github.com/mtkaya/transformer-edge-optimization) |
|
|
[](https://github.com/mtkaya/transformer-edge-optimization/blob/main/LICENSE) |
|
|
[](https://colab.research.google.com/github/mtkaya/transformer-edge-optimization/blob/main/notebooks/01_quantization_basics.ipynb) |
|
|
|
|
|
**Interactive demo comparing Original vs Quantized transformer models** |
|
|
|
|
|
[Try Demo](#) β’ [GitHub Repo](https://github.com/mtkaya/transformer-edge-optimization) β’ [Notebooks](https://github.com/mtkaya/transformer-edge-optimization/tree/main/notebooks) |
|
|
|
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## π― What Does This Demo Do? |
|
|
|
|
|
This interactive demo showcases **model quantization** - a technique to make AI models smaller and faster for mobile/edge devices. |
|
|
|
|
|
### Try It: |
|
|
1. **Quick Prediction** - Test sentiment analysis with quantized model |
|
|
2. **Model Comparison** - Compare Original (FP32) vs Quantized (INT8) side by side |
|
|
3. **Documentation** - Learn about the techniques |
|
|
|
|
|
--- |
|
|
|
|
|
## β¨ Key Results |
|
|
|
|
|
| Metric | Original | Quantized | Improvement | |
|
|
|--------|----------|-----------|-------------| |
|
|
| **Size** | 255 MB | 68 MB | **3.75x smaller** β¬οΈ | |
|
|
| **Speed** | 12.3 ms | 5.8 ms | **2.1x faster** β‘ | |
|
|
| **Accuracy** | 91.8% | 90.2% | **-1.6%** π | |
|
|
|
|
|
**Conclusion:** Nearly **4x smaller** model with **2x faster** inference and only **1.6% accuracy loss**! |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ͺ What is Quantization? |
|
|
|
|
|
**Quantization** reduces model size by converting weights from 32-bit floating point (FP32) to 8-bit integers (INT8). |
|
|
|
|
|
### How It Works: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForSequenceClassification |
|
|
|
|
|
# Load model |
|
|
model = AutoModelForSequenceClassification.from_pretrained( |
|
|
"distilbert-base-uncased-finetuned-sst-2-english" |
|
|
) |
|
|
|
|
|
# Quantize: FP32 β INT8 |
|
|
quantized = torch.quantization.quantize_dynamic( |
|
|
model, {torch.nn.Linear}, dtype=torch.qint8 |
|
|
) |
|
|
|
|
|
# Now 4x smaller! π |
|
|
``` |
|
|
|
|
|
### Why Quantization? |
|
|
|
|
|
- β
**Smaller models** - Fit on mobile devices |
|
|
- β
**Faster inference** - Better user experience |
|
|
- β
**Lower power** - Longer battery life |
|
|
- β
**Easy to implement** - Post-training, no retraining |
|
|
|
|
|
--- |
|
|
|
|
|
## π Optimization Techniques |
|
|
|
|
|
This project demonstrates **3 major techniques**: |
|
|
|
|
|
### 1. **Quantization** (This Demo) |
|
|
- **Compression:** 4x |
|
|
- **Speed:** 2-3x faster |
|
|
- **Difficulty:** β Easy |
|
|
|
|
|
### 2. **ONNX Runtime** |
|
|
- **Compression:** 3.8x |
|
|
- **Speed:** 2.2x faster |
|
|
- **Difficulty:** ββ Medium |
|
|
- **Benefit:** Cross-platform deployment |
|
|
|
|
|
### 3. **Knowledge Distillation** |
|
|
- **Compression:** 6-10x |
|
|
- **Speed:** 3x faster |
|
|
- **Difficulty:** βββ Advanced |
|
|
- **Benefit:** Student model learns from teacher |
|
|
|
|
|
--- |
|
|
|
|
|
## π Try The Full Toolkit |
|
|
|
|
|
### Interactive Notebooks (Google Colab): |
|
|
|
|
|
#### 1. Quantization Basics (15 minutes) |
|
|
[](https://colab.research.google.com/github/mtkaya/transformer-edge-optimization/blob/main/notebooks/01_quantization_basics.ipynb) |
|
|
|
|
|
**Learn:** |
|
|
- Dynamic quantization |
|
|
- Static quantization |
|
|
- Model size comparison |
|
|
- Performance benchmarking |
|
|
|
|
|
--- |
|
|
|
|
|
#### 2. ONNX Runtime Optimization (20 minutes) |
|
|
[](https://colab.research.google.com/github/mtkaya/transformer-edge-optimization/blob/main/notebooks/02_huggingface_optimum.ipynb) |
|
|
|
|
|
**Learn:** |
|
|
- PyTorch β ONNX conversion |
|
|
- Hugging Face Optimum |
|
|
- Cross-platform deployment |
|
|
- Hardware acceleration |
|
|
|
|
|
--- |
|
|
|
|
|
#### 3. Knowledge Distillation (30 minutes) |
|
|
[](https://colab.research.google.com/github/mtkaya/transformer-edge-optimization/blob/main/notebooks/05_distilbert_training.ipynb) |
|
|
|
|
|
**Learn:** |
|
|
- Teacher-student training |
|
|
- Distillation loss |
|
|
- Creating tiny models |
|
|
- BERT β TinyBERT |
|
|
|
|
|
--- |
|
|
|
|
|
## π» Use Cases |
|
|
|
|
|
### π± Mobile Apps |
|
|
```kotlin |
|
|
// Android with TFLite |
|
|
val analyzer = SentimentAnalyzer(context) |
|
|
val result = analyzer.predict("Great app!") |
|
|
``` |
|
|
|
|
|
### π Web Apps |
|
|
```javascript |
|
|
// Browser with Transformers.js |
|
|
import { pipeline } from '@xenova/transformers'; |
|
|
const classifier = await pipeline('sentiment-analysis'); |
|
|
``` |
|
|
|
|
|
### π€ Edge Devices |
|
|
```python |
|
|
# Raspberry Pi with ONNX Runtime |
|
|
import onnxruntime as ort |
|
|
session = ort.InferenceSession("model.onnx") |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Full Documentation |
|
|
|
|
|
### GitHub Repository |
|
|
**[mtkaya/transformer-edge-optimization](https://github.com/mtkaya/transformer-edge-optimization)** |
|
|
|
|
|
Contains: |
|
|
- β
3 Jupyter notebooks |
|
|
- β
Example code (Python, Kotlin, JavaScript) |
|
|
- β
Comprehensive documentation |
|
|
- β
CI/CD pipeline |
|
|
- β
Docker support |
|
|
|
|
|
### Quick Links: |
|
|
- [Installation Guide](https://github.com/mtkaya/transformer-edge-optimization#-installation) |
|
|
- [Usage Examples](https://github.com/mtkaya/transformer-edge-optimization#-examples) |
|
|
- [API Reference](https://github.com/mtkaya/transformer-edge-optimization#-api-reference) |
|
|
- [Contributing](https://github.com/mtkaya/transformer-edge-optimization/blob/main/CONTRIBUTING.md) |
|
|
|
|
|
--- |
|
|
|
|
|
## π Technical Details |
|
|
|
|
|
### Model Used: |
|
|
**DistilBERT** fine-tuned on SST-2 (Stanford Sentiment Treebank) |
|
|
|
|
|
- Base Model: `distilbert-base-uncased-finetuned-sst-2-english` |
|
|
- Parameters: 67M |
|
|
- Task: Binary sentiment classification (Positive/Negative) |
|
|
|
|
|
### Quantization Approach: |
|
|
**Dynamic Quantization** with PyTorch |
|
|
|
|
|
- Weights: INT8 (8-bit integers) |
|
|
- Activations: FP32 (computed at runtime) |
|
|
- Method: `torch.quantization.quantize_dynamic()` |
|
|
|
|
|
### Benchmark Hardware: |
|
|
- **CPU:** Intel Xeon (Colab) |
|
|
- **Input:** 128 tokens average |
|
|
- **Iterations:** 100 runs per test |
|
|
|
|
|
--- |
|
|
|
|
|
## π Detailed Benchmark |
|
|
|
|
|
### Model Size: |
|
|
``` |
|
|
Original (FP32): 255.43 MB |
|
|
Quantized (INT8): 68.12 MB |
|
|
Compression Ratio: 3.75x |
|
|
Space Saved: 187.31 MB (73.3%) |
|
|
``` |
|
|
|
|
|
### Inference Speed (CPU): |
|
|
``` |
|
|
Original: 12.34 Β± 0.45 ms |
|
|
Quantized: 5.78 Β± 0.23 ms |
|
|
Speedup: 2.13x |
|
|
Time Saved: 6.56 ms per inference (53.2%) |
|
|
``` |
|
|
|
|
|
### Accuracy (SST-2 Test Set): |
|
|
``` |
|
|
Original: 91.8% accuracy |
|
|
Quantized: 90.2% accuracy |
|
|
Difference: -1.6% |
|
|
``` |
|
|
|
|
|
### Memory Usage: |
|
|
``` |
|
|
Original: ~280 MB |
|
|
Quantized: ~95 MB |
|
|
Reduction: 2.95x |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Features of This Demo |
|
|
|
|
|
### π― Quick Prediction |
|
|
- Enter any text |
|
|
- Toggle between Original/Quantized |
|
|
- See prediction + confidence + model info |
|
|
|
|
|
### βοΈ Model Comparison |
|
|
- Side-by-side comparison |
|
|
- Same input, both models |
|
|
- Performance metrics |
|
|
|
|
|
### π Documentation |
|
|
- Learn about quantization |
|
|
- See benchmark results |
|
|
- Access notebooks |
|
|
- Quick start code |
|
|
|
|
|
--- |
|
|
|
|
|
## π€ Contributing |
|
|
|
|
|
We welcome contributions! Check out: |
|
|
|
|
|
- **GitHub Issues:** [Report bugs](https://github.com/mtkaya/transformer-edge-optimization/issues) |
|
|
- **Discussions:** [Ask questions](https://github.com/mtkaya/transformer-edge-optimization/discussions) |
|
|
- **Pull Requests:** [Contribute code](https://github.com/mtkaya/transformer-edge-optimization/pulls) |
|
|
|
|
|
--- |
|
|
|
|
|
## π License |
|
|
|
|
|
This project is licensed under the **MIT License**. |
|
|
|
|
|
See [LICENSE](https://github.com/mtkaya/transformer-edge-optimization/blob/main/LICENSE) for details. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Acknowledgments |
|
|
|
|
|
Built with: |
|
|
- [Hugging Face Transformers](https://github.com/huggingface/transformers) |
|
|
- [PyTorch](https://pytorch.org/) |
|
|
- [Gradio](https://gradio.app/) |
|
|
|
|
|
Inspired by: |
|
|
- [DistilBERT paper](https://arxiv.org/abs/1910.01108) (Sanh et al., 2019) |
|
|
- [Q8BERT](https://arxiv.org/abs/1910.06188) (Zafrir et al., 2021) |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Contact |
|
|
|
|
|
- **GitHub:** [@mtkaya](https://github.com/mtkaya) |
|
|
- **Issues:** [Report here](https://github.com/mtkaya/transformer-edge-optimization/issues) |
|
|
|
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**β Star the repo if you find this useful! β** |
|
|
|
|
|
[GitHub Repository](https://github.com/mtkaya/transformer-edge-optimization) β’ |
|
|
[Documentation](https://github.com/mtkaya/transformer-edge-optimization#readme) β’ |
|
|
[Notebooks](https://github.com/mtkaya/transformer-edge-optimization/tree/main/notebooks) |
|
|
|
|
|
**Made with β€οΈ for the AI community** |
|
|
|
|
|
</div> |