tugrulkaya's picture
Update README.md
20e779b verified

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: Transformer Edge Optimization
emoji: πŸš€
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
tags:
  - quantization
  - optimization
  - edge-ai
  - mobile
  - transformers
  - onnx
  - sentiment-analysis
duplicated_from: null

πŸš€ Transformer Edge Optimization Demo

GitHub License Colab

Interactive demo comparing Original vs Quantized transformer models

Try Demo β€’ GitHub Repo β€’ Notebooks


🎯 What Does This Demo Do?

This interactive demo showcases model quantization - a technique to make AI models smaller and faster for mobile/edge devices.

Try It:

  1. Quick Prediction - Test sentiment analysis with quantized model
  2. Model Comparison - Compare Original (FP32) vs Quantized (INT8) side by side
  3. Documentation - Learn about the techniques

✨ Key Results

Metric Original Quantized Improvement
Size 255 MB 68 MB 3.75x smaller ⬇️
Speed 12.3 ms 5.8 ms 2.1x faster ⚑
Accuracy 91.8% 90.2% -1.6% πŸ“Š

Conclusion: Nearly 4x smaller model with 2x faster inference and only 1.6% accuracy loss!


πŸ§ͺ What is Quantization?

Quantization reduces model size by converting weights from 32-bit floating point (FP32) to 8-bit integers (INT8).

How It Works:

import torch
from transformers import AutoModelForSequenceClassification

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)

# Quantize: FP32 β†’ INT8
quantized = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

# Now 4x smaller! πŸŽ‰

Why Quantization?

  • βœ… Smaller models - Fit on mobile devices
  • βœ… Faster inference - Better user experience
  • βœ… Lower power - Longer battery life
  • βœ… Easy to implement - Post-training, no retraining

πŸ“Š Optimization Techniques

This project demonstrates 3 major techniques:

1. Quantization (This Demo)

  • Compression: 4x
  • Speed: 2-3x faster
  • Difficulty: ⭐ Easy

2. ONNX Runtime

  • Compression: 3.8x
  • Speed: 2.2x faster
  • Difficulty: ⭐⭐ Medium
  • Benefit: Cross-platform deployment

3. Knowledge Distillation

  • Compression: 6-10x
  • Speed: 3x faster
  • Difficulty: ⭐⭐⭐ Advanced
  • Benefit: Student model learns from teacher

πŸš€ Try The Full Toolkit

Interactive Notebooks (Google Colab):

1. Quantization Basics (15 minutes)

Open In Colab

Learn:

  • Dynamic quantization
  • Static quantization
  • Model size comparison
  • Performance benchmarking

2. ONNX Runtime Optimization (20 minutes)

Open In Colab

Learn:

  • PyTorch β†’ ONNX conversion
  • Hugging Face Optimum
  • Cross-platform deployment
  • Hardware acceleration

3. Knowledge Distillation (30 minutes)

Open In Colab

Learn:

  • Teacher-student training
  • Distillation loss
  • Creating tiny models
  • BERT β†’ TinyBERT

πŸ’» Use Cases

πŸ“± Mobile Apps

// Android with TFLite
val analyzer = SentimentAnalyzer(context)
val result = analyzer.predict("Great app!")

🌐 Web Apps

// Browser with Transformers.js
import { pipeline } from '@xenova/transformers';
const classifier = await pipeline('sentiment-analysis');

πŸ€– Edge Devices

# Raspberry Pi with ONNX Runtime
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")

πŸ“š Full Documentation

GitHub Repository

mtkaya/transformer-edge-optimization

Contains:

  • βœ… 3 Jupyter notebooks
  • βœ… Example code (Python, Kotlin, JavaScript)
  • βœ… Comprehensive documentation
  • βœ… CI/CD pipeline
  • βœ… Docker support

Quick Links:


πŸŽ“ Technical Details

Model Used:

DistilBERT fine-tuned on SST-2 (Stanford Sentiment Treebank)

  • Base Model: distilbert-base-uncased-finetuned-sst-2-english
  • Parameters: 67M
  • Task: Binary sentiment classification (Positive/Negative)

Quantization Approach:

Dynamic Quantization with PyTorch

  • Weights: INT8 (8-bit integers)
  • Activations: FP32 (computed at runtime)
  • Method: torch.quantization.quantize_dynamic()

Benchmark Hardware:

  • CPU: Intel Xeon (Colab)
  • Input: 128 tokens average
  • Iterations: 100 runs per test

πŸ“Š Detailed Benchmark

Model Size:

Original (FP32):     255.43 MB
Quantized (INT8):     68.12 MB
Compression Ratio:    3.75x
Space Saved:         187.31 MB (73.3%)

Inference Speed (CPU):

Original:   12.34 Β± 0.45 ms
Quantized:   5.78 Β± 0.23 ms
Speedup:     2.13x
Time Saved:  6.56 ms per inference (53.2%)

Accuracy (SST-2 Test Set):

Original:   91.8% accuracy
Quantized:  90.2% accuracy
Difference: -1.6%

Memory Usage:

Original:   ~280 MB
Quantized:  ~95 MB
Reduction:  2.95x

🌟 Features of This Demo

🎯 Quick Prediction

  • Enter any text
  • Toggle between Original/Quantized
  • See prediction + confidence + model info

βš–οΈ Model Comparison

  • Side-by-side comparison
  • Same input, both models
  • Performance metrics

πŸ“š Documentation

  • Learn about quantization
  • See benchmark results
  • Access notebooks
  • Quick start code

🀝 Contributing

We welcome contributions! Check out:


πŸ“„ License

This project is licensed under the MIT License.

See LICENSE for details.


πŸ™ Acknowledgments

Built with:

Inspired by:


πŸ“§ Contact


⭐ Star the repo if you find this useful! ⭐

GitHub Repository β€’ Documentation β€’ Notebooks

Made with ❀️ for the AI community