Spaces:

tugrulkaya
/

transformer-edge-optimization

Sleeping

App Files Files Community

transformer-edge-optimization / README.md

tugrulkaya

Update README.md

20e779b verified about 1 month ago

preview code

raw

history blame contribute delete

8.43 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

metadata

title: Transformer Edge Optimization
emoji: 🚀
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
tags:
  - quantization
  - optimization
  - edge-ai
  - mobile
  - transformers
  - onnx
  - sentiment-analysis
duplicated_from: null

🚀 Transformer Edge Optimization Demo

Interactive demo comparing Original vs Quantized transformer models

Try Demo • GitHub Repo • Notebooks

🎯 What Does This Demo Do?

This interactive demo showcases model quantization - a technique to make AI models smaller and faster for mobile/edge devices.

Try It:

Quick Prediction - Test sentiment analysis with quantized model
Model Comparison - Compare Original (FP32) vs Quantized (INT8) side by side
Documentation - Learn about the techniques

✨ Key Results

Metric	Original	Quantized	Improvement
Size	255 MB	68 MB	3.75x smaller ⬇️
Speed	12.3 ms	5.8 ms	2.1x faster ⚡
Accuracy	91.8%	90.2%	-1.6% 📊

Conclusion: Nearly 4x smaller model with 2x faster inference and only 1.6% accuracy loss!

🧪 What is Quantization?

Quantization reduces model size by converting weights from 32-bit floating point (FP32) to 8-bit integers (INT8).

How It Works:

import torch
from transformers import AutoModelForSequenceClassification

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)

# Quantize: FP32 → INT8
quantized = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

# Now 4x smaller! 🎉

Why Quantization?

✅ Smaller models - Fit on mobile devices
✅ Faster inference - Better user experience
✅ Lower power - Longer battery life
✅ Easy to implement - Post-training, no retraining

📊 Optimization Techniques

This project demonstrates 3 major techniques:

1. Quantization (This Demo)

Compression: 4x
Speed: 2-3x faster
Difficulty: ⭐ Easy

2. ONNX Runtime

Compression: 3.8x
Speed: 2.2x faster
Difficulty: ⭐⭐ Medium
Benefit: Cross-platform deployment

3. Knowledge Distillation

Compression: 6-10x
Speed: 3x faster
Difficulty: ⭐⭐⭐ Advanced
Benefit: Student model learns from teacher

🚀 Try The Full Toolkit

Interactive Notebooks (Google Colab):

1. Quantization Basics (15 minutes)

Learn:

Dynamic quantization
Static quantization
Model size comparison
Performance benchmarking

2. ONNX Runtime Optimization (20 minutes)

Learn:

PyTorch → ONNX conversion
Hugging Face Optimum
Cross-platform deployment
Hardware acceleration

3. Knowledge Distillation (30 minutes)

Learn:

Teacher-student training
Distillation loss
Creating tiny models
BERT → TinyBERT

💻 Use Cases

📱 Mobile Apps

// Android with TFLite
val analyzer = SentimentAnalyzer(context)
val result = analyzer.predict("Great app!")

🌐 Web Apps

// Browser with Transformers.js
import { pipeline } from '@xenova/transformers';
const classifier = await pipeline('sentiment-analysis');

🤖 Edge Devices

# Raspberry Pi with ONNX Runtime
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")

📚 Full Documentation

GitHub Repository

mtkaya/transformer-edge-optimization

Contains:

✅ 3 Jupyter notebooks
✅ Example code (Python, Kotlin, JavaScript)
✅ Comprehensive documentation
✅ CI/CD pipeline
✅ Docker support

Quick Links:

🎓 Technical Details

Model Used:

DistilBERT fine-tuned on SST-2 (Stanford Sentiment Treebank)

Base Model: distilbert-base-uncased-finetuned-sst-2-english
Parameters: 67M
Task: Binary sentiment classification (Positive/Negative)

Quantization Approach:

Dynamic Quantization with PyTorch

Weights: INT8 (8-bit integers)
Activations: FP32 (computed at runtime)
Method: torch.quantization.quantize_dynamic()

Benchmark Hardware:

CPU: Intel Xeon (Colab)
Input: 128 tokens average
Iterations: 100 runs per test

📊 Detailed Benchmark

Model Size:

Original (FP32):     255.43 MB
Quantized (INT8):     68.12 MB
Compression Ratio:    3.75x
Space Saved:         187.31 MB (73.3%)

Inference Speed (CPU):

Original:   12.34 ± 0.45 ms
Quantized:   5.78 ± 0.23 ms
Speedup:     2.13x
Time Saved:  6.56 ms per inference (53.2%)

Accuracy (SST-2 Test Set):

Original:   91.8% accuracy
Quantized:  90.2% accuracy
Difference: -1.6%

Memory Usage:

Original:   ~280 MB
Quantized:  ~95 MB
Reduction:  2.95x

🌟 Features of This Demo

🎯 Quick Prediction

Enter any text
Toggle between Original/Quantized
See prediction + confidence + model info

⚖️ Model Comparison

Side-by-side comparison
Same input, both models
Performance metrics

📚 Documentation

Learn about quantization
See benchmark results
Access notebooks
Quick start code

🤝 Contributing

We welcome contributions! Check out:

GitHub Issues: Report bugs
Discussions: Ask questions
Pull Requests: Contribute code

📄 License

This project is licensed under the MIT License.

See LICENSE for details.

🙏 Acknowledgments

Built with:

Inspired by:

DistilBERT paper (Sanh et al., 2019)
Q8BERT (Zafrir et al., 2021)

📧 Contact

GitHub: @mtkaya
Issues: Report here

⭐ Star the repo if you find this useful! ⭐

GitHub Repository • Documentation • Notebooks

Made with ❤️ for the AI community