# Gradio App Summary for Grasp Any Region (GAR) ## ✅ Completion Status Successfully created a comprehensive Gradio demo for the Grasp Any Region (GAR) project. ## 📁 Files Created/Modified ### 1. **app.py** (NEW) - Complete Gradio interface with 3 tabs: - **Points → Describe**: Interactive point-based segmentation with SAM - **Box → Describe**: Bounding box-based segmentation - **Mask → Describe**: Direct mask upload for region description - Features: - ZeroGPU integration with `@spaces.GPU` decorator - Proper import order (spaces first, then CUDA packages) - SAM (Segment Anything Model) integration for interactive segmentation - GAR-1B model for detailed region descriptions - Visualization with contours and input annotations - Example images and clear instructions - Error handling and status messages ### 2. **requirements.txt** (UPDATED) - Gradio 5.49.1 (required version) - httpx version fixed to >=0.24.1,<1.0 (Gradio compatibility) - PyTorch 2.8.0 (pinned for FlashAttention compatibility) - FlashAttention 2.8.3 prebuilt wheel (PyTorch 2.8, Python 3.10, CUDA 12, abiFALSE) - spaces==0.30.4 for ZeroGPU - All original dependencies preserved - Segment Anything from GitHub - Vision libraries (opencv-python, pillow, pycocotools) - Transformers 4.56.2 and supporting ML libraries ## 🎯 Key Features 1. **Three Interaction Modes**: - Points: Click or enter coordinates to segment regions - Box: Draw or enter bounding boxes - Mask: Upload pre-made masks directly 2. **Model Integration**: - GAR-1B for region understanding (1 billion parameters) - SAM ViT-Huge for automatic segmentation - Both models loaded once at startup for efficiency 3. **ZeroGPU Optimization**: - Proper `@spaces.GPU(duration=120)` decorator usage - 2-minute GPU allocation per function call - NVIDIA H200 with 70GB VRAM available - Critical import order: `spaces` imported before torch 4. **User Experience**: - Clear step-by-step instructions - Example images included - Real-time visualization with overlays - Comprehensive error handling - Professional UI with Gradio 5.x Soft theme ## 🔧 Technical Details ### Import Order (CRITICAL) ```python # 🚨 spaces MUST be imported FIRST import spaces # Then import CUDA packages import torch from transformers import AutoModel, AutoProcessor ``` This prevents the "CUDA has been initialized" error. ### FlashAttention Configuration - Using prebuilt wheel for PyTorch 2.8.0 - Python 3.10 (cp310) - CUDA 12 (cu12) - abiFALSE (REQUIRED - never use abiTRUE) - URL: https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl ### Model Loading Strategy - Models loaded once at startup (outside decorated functions) - Moved to CUDA device after loading - GPU-decorated functions only handle inference - Efficient memory usage ## 📋 Dependencies Highlights **Core:** - gradio==5.49.1 - torch==2.8.0 - spaces==0.30.4 - flash-attn (prebuilt wheel) **AI/ML:** - transformers==4.56.2 - accelerate>=0.28.0 - timm==1.0.19 - peft==0.15.2 **Vision:** - opencv-python - pillow>=9.4.0 - segment-anything (from GitHub) - pycocotools ## 🎨 UI Structure ``` Grasp Any Region (GAR) Demo ├── Introduction & Links ├── Tab 1: Points → Describe │ ├── Image upload + points input │ ├── Generate Mask button │ ├── Describe Region button │ └── Outputs: mask, visualization, description ├── Tab 2: Box → Describe │ ├── Image upload + box input │ ├── Generate Mask button │ ├── Describe Region button │ └── Outputs: mask, visualization, description ├── Tab 3: Mask → Describe │ ├── Image upload + mask upload │ ├── Describe Region button │ └── Outputs: visualization, description └── Documentation & Citation ``` ## 🚀 How to Run ```bash # Install dependencies pip install -r requirements.txt # Run the app python app.py ``` The app will automatically: 1. Load GAR-1B and SAM models 2. Launch Gradio interface 3. Allocate GPU on-demand with ZeroGPU ## 📊 Expected Performance - **Model**: GAR-1B (lightweight, fast inference) - **GPU**: NVIDIA H200, 70GB VRAM - **Inference Time**: ~10-30 seconds per region (depending on complexity) - **Max New Tokens**: 1024 (configurable) ## ⚠️ Important Notes 1. **Import Order**: Always import `spaces` before torch/CUDA packages 2. **Python Version**: Requires Python 3.10 (for FlashAttention wheel) 3. **FlashAttention**: Uses prebuilt wheel (no compilation needed) 4. **Asset Files**: Demo expects images in `assets/` directory 5. **SingleRegionCaptionDataset**: Required from evaluation module ## 🔗 References - **Paper**: https://arxiv.org/abs/2510.18876 - **GitHub**: https://github.com/Haochen-Wang409/Grasp-Any-Region - **Model**: https://huggingface.co/HaochenWang/GAR-1B - **SAM**: https://github.com/facebookresearch/segment-anything ## 📝 Citation ```bibtex @article{wang2025grasp, title={Grasp Any Region: Prompting MLLM to Understand the Dense World}, author={Haochen Wang et al.}, journal={arXiv preprint arXiv:2510.18876}, year={2025} } ``` --- **Created**: 2025-10-25 **Status**: ✅ Ready for deployment **Hardware**: zerogpu (NVIDIA H200, 70GB VRAM)