Spaces:

tugrulkaya
/

transformer-edge-optimization

Sleeping

App Files Files Community

transformer-edge-optimization / README.md

tugrulkaya

Update README.md

20e779b verified about 1 month ago

preview code

raw

history blame contribute delete

8.43 kB

	---
	title: Transformer Edge Optimization
	emoji: 🚀
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 5.49.1
	app_file: app.py
	pinned: false
	license: mit
	tags:
	- quantization
	- optimization
	- edge-ai
	- mobile
	- transformers
	- onnx
	- sentiment-analysis
	duplicated_from: null
	---

	# 🚀 Transformer Edge Optimization Demo

	<div align="center">

	[![GitHub](https://img.shields.io/badge/GitHub-Repository-blue?logo=github)](https://github.com/mtkaya/transformer-edge-optimization)
	[![License](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/mtkaya/transformer-edge-optimization/blob/main/LICENSE)
	[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mtkaya/transformer-edge-optimization/blob/main/notebooks/01_quantization_basics.ipynb)

	Interactive demo comparing Original vs Quantized transformer models

	[Try Demo](#) • [GitHub Repo](https://github.com/mtkaya/transformer-edge-optimization) • [Notebooks](https://github.com/mtkaya/transformer-edge-optimization/tree/main/notebooks)

	</div>

	---

	## 🎯 What Does This Demo Do?

	This interactive demo showcases model quantization - a technique to make AI models smaller and faster for mobile/edge devices.

	### Try It:
	1. Quick Prediction - Test sentiment analysis with quantized model
	2. Model Comparison - Compare Original (FP32) vs Quantized (INT8) side by side
	3. Documentation - Learn about the techniques

	---

	## ✨ Key Results

	\| Metric \| Original \| Quantized \| Improvement \|
	\|--------\|----------\|-----------\|-------------\|
	\| Size \| 255 MB \| 68 MB \| 3.75x smaller ⬇️ \|
	\| Speed \| 12.3 ms \| 5.8 ms \| 2.1x faster ⚡ \|
	\| Accuracy \| 91.8% \| 90.2% \| -1.6% 📊 \|

	Conclusion: Nearly 4x smaller model with 2x faster inference and only 1.6% accuracy loss!

	---

	## 🧪 What is Quantization?

	Quantization reduces model size by converting weights from 32-bit floating point (FP32) to 8-bit integers (INT8).

	### How It Works:

	```python
	import torch
	from transformers import AutoModelForSequenceClassification

	# Load model
	model = AutoModelForSequenceClassification.from_pretrained(
	"distilbert-base-uncased-finetuned-sst-2-english"
	)

	# Quantize: FP32 → INT8
	quantized = torch.quantization.quantize_dynamic(
	model, {torch.nn.Linear}, dtype=torch.qint8
	)

	# Now 4x smaller! 🎉
	```

	### Why Quantization?

	- ✅ Smaller models - Fit on mobile devices
	- ✅ Faster inference - Better user experience
	- ✅ Lower power - Longer battery life
	- ✅ Easy to implement - Post-training, no retraining

	---

	## 📊 Optimization Techniques

	This project demonstrates 3 major techniques:

	### 1. Quantization (This Demo)
	- Compression: 4x
	- Speed: 2-3x faster
	- Difficulty: ⭐ Easy

	### 2. ONNX Runtime
	- Compression: 3.8x
	- Speed: 2.2x faster
	- Difficulty: ⭐⭐ Medium
	- Benefit: Cross-platform deployment

	### 3. Knowledge Distillation
	- Compression: 6-10x
	- Speed: 3x faster
	- Difficulty: ⭐⭐⭐ Advanced
	- Benefit: Student model learns from teacher

	---

	## 🚀 Try The Full Toolkit

	### Interactive Notebooks (Google Colab):

	#### 1. Quantization Basics (15 minutes)
	[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mtkaya/transformer-edge-optimization/blob/main/notebooks/01_quantization_basics.ipynb)

	Learn:
	- Dynamic quantization
	- Static quantization
	- Model size comparison
	- Performance benchmarking

	---

	#### 2. ONNX Runtime Optimization (20 minutes)
	[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mtkaya/transformer-edge-optimization/blob/main/notebooks/02_huggingface_optimum.ipynb)

	Learn:
	- PyTorch → ONNX conversion
	- Hugging Face Optimum
	- Cross-platform deployment
	- Hardware acceleration

	---

	#### 3. Knowledge Distillation (30 minutes)
	[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mtkaya/transformer-edge-optimization/blob/main/notebooks/05_distilbert_training.ipynb)

	Learn:
	- Teacher-student training
	- Distillation loss
	- Creating tiny models
	- BERT → TinyBERT

	---

	## 💻 Use Cases

	### 📱 Mobile Apps
	```kotlin
	// Android with TFLite
	val analyzer = SentimentAnalyzer(context)
	val result = analyzer.predict("Great app!")
	```

	### 🌐 Web Apps
	```javascript
	// Browser with Transformers.js
	import { pipeline } from '@xenova/transformers';
	const classifier = await pipeline('sentiment-analysis');
	```

	### 🤖 Edge Devices
	```python
	# Raspberry Pi with ONNX Runtime
	import onnxruntime as ort
	session = ort.InferenceSession("model.onnx")
	```

	---

	## 📚 Full Documentation

	### GitHub Repository
	[mtkaya/transformer-edge-optimization](https://github.com/mtkaya/transformer-edge-optimization)

	Contains:
	- ✅ 3 Jupyter notebooks
	- ✅ Example code (Python, Kotlin, JavaScript)
	- ✅ Comprehensive documentation
	- ✅ CI/CD pipeline
	- ✅ Docker support

	### Quick Links:
	- [Installation Guide](https://github.com/mtkaya/transformer-edge-optimization#-installation)
	- [Usage Examples](https://github.com/mtkaya/transformer-edge-optimization#-examples)
	- [API Reference](https://github.com/mtkaya/transformer-edge-optimization#-api-reference)
	- [Contributing](https://github.com/mtkaya/transformer-edge-optimization/blob/main/CONTRIBUTING.md)

	---

	## 🎓 Technical Details

	### Model Used:
	DistilBERT fine-tuned on SST-2 (Stanford Sentiment Treebank)

	- Base Model: `distilbert-base-uncased-finetuned-sst-2-english`
	- Parameters: 67M
	- Task: Binary sentiment classification (Positive/Negative)

	### Quantization Approach:
	Dynamic Quantization with PyTorch

	- Weights: INT8 (8-bit integers)
	- Activations: FP32 (computed at runtime)
	- Method: `torch.quantization.quantize_dynamic()`

	### Benchmark Hardware:
	- CPU: Intel Xeon (Colab)
	- Input: 128 tokens average
	- Iterations: 100 runs per test

	---

	## 📊 Detailed Benchmark

	### Model Size:
	```
	Original (FP32): 255.43 MB
	Quantized (INT8): 68.12 MB
	Compression Ratio: 3.75x
	Space Saved: 187.31 MB (73.3%)
	```

	### Inference Speed (CPU):
	```
	Original: 12.34 ± 0.45 ms
	Quantized: 5.78 ± 0.23 ms
	Speedup: 2.13x
	Time Saved: 6.56 ms per inference (53.2%)
	```

	### Accuracy (SST-2 Test Set):
	```
	Original: 91.8% accuracy
	Quantized: 90.2% accuracy
	Difference: -1.6%
	```

	### Memory Usage:
	```
	Original: ~280 MB
	Quantized: ~95 MB
	Reduction: 2.95x
	```

	---

	## 🌟 Features of This Demo

	### 🎯 Quick Prediction
	- Enter any text
	- Toggle between Original/Quantized
	- See prediction + confidence + model info

	### ⚖️ Model Comparison
	- Side-by-side comparison
	- Same input, both models
	- Performance metrics

	### 📚 Documentation
	- Learn about quantization
	- See benchmark results
	- Access notebooks
	- Quick start code

	---

	## 🤝 Contributing

	We welcome contributions! Check out:

	- GitHub Issues: [Report bugs](https://github.com/mtkaya/transformer-edge-optimization/issues)
	- Discussions: [Ask questions](https://github.com/mtkaya/transformer-edge-optimization/discussions)
	- Pull Requests: [Contribute code](https://github.com/mtkaya/transformer-edge-optimization/pulls)

	---

	## 📄 License

	This project is licensed under the MIT License.

	See [LICENSE](https://github.com/mtkaya/transformer-edge-optimization/blob/main/LICENSE) for details.

	---

	## 🙏 Acknowledgments

	Built with:
	- [Hugging Face Transformers](https://github.com/huggingface/transformers)
	- [PyTorch](https://pytorch.org/)
	- [Gradio](https://gradio.app/)

	Inspired by:
	- [DistilBERT paper](https://arxiv.org/abs/1910.01108) (Sanh et al., 2019)
	- [Q8BERT](https://arxiv.org/abs/1910.06188) (Zafrir et al., 2021)

	---

	## 📧 Contact

	- GitHub: [@mtkaya](https://github.com/mtkaya)
	- Issues: [Report here](https://github.com/mtkaya/transformer-edge-optimization/issues)

	---

	<div align="center">

	⭐ Star the repo if you find this useful! ⭐

	[GitHub Repository](https://github.com/mtkaya/transformer-edge-optimization) •
	[Documentation](https://github.com/mtkaya/transformer-edge-optimization#readme) •
	[Notebooks](https://github.com/mtkaya/transformer-edge-optimization/tree/main/notebooks)

	Made with ❤️ for the AI community

	</div>