Spaces:
Running
on
Zero
Running
on
Zero
| # Gradio App Summary for Grasp Any Region (GAR) | |
| ## β Completion Status | |
| Successfully created a comprehensive Gradio demo for the Grasp Any Region (GAR) project. | |
| ## π Files Created/Modified | |
| ### 1. **app.py** (NEW) | |
| - Complete Gradio interface with 3 tabs: | |
| - **Points β Describe**: Interactive point-based segmentation with SAM | |
| - **Box β Describe**: Bounding box-based segmentation | |
| - **Mask β Describe**: Direct mask upload for region description | |
| - Features: | |
| - ZeroGPU integration with `@spaces.GPU` decorator | |
| - Proper import order (spaces first, then CUDA packages) | |
| - SAM (Segment Anything Model) integration for interactive segmentation | |
| - GAR-1B model for detailed region descriptions | |
| - Visualization with contours and input annotations | |
| - Example images and clear instructions | |
| - Error handling and status messages | |
| ### 2. **requirements.txt** (UPDATED) | |
| - Gradio 5.49.1 (required version) | |
| - httpx version fixed to >=0.24.1,<1.0 (Gradio compatibility) | |
| - PyTorch 2.8.0 (pinned for FlashAttention compatibility) | |
| - FlashAttention 2.8.3 prebuilt wheel (PyTorch 2.8, Python 3.10, CUDA 12, abiFALSE) | |
| - spaces==0.30.4 for ZeroGPU | |
| - All original dependencies preserved | |
| - Segment Anything from GitHub | |
| - Vision libraries (opencv-python, pillow, pycocotools) | |
| - Transformers 4.56.2 and supporting ML libraries | |
| ## π― Key Features | |
| 1. **Three Interaction Modes**: | |
| - Points: Click or enter coordinates to segment regions | |
| - Box: Draw or enter bounding boxes | |
| - Mask: Upload pre-made masks directly | |
| 2. **Model Integration**: | |
| - GAR-1B for region understanding (1 billion parameters) | |
| - SAM ViT-Huge for automatic segmentation | |
| - Both models loaded once at startup for efficiency | |
| 3. **ZeroGPU Optimization**: | |
| - Proper `@spaces.GPU(duration=120)` decorator usage | |
| - 2-minute GPU allocation per function call | |
| - NVIDIA H200 with 70GB VRAM available | |
| - Critical import order: `spaces` imported before torch | |
| 4. **User Experience**: | |
| - Clear step-by-step instructions | |
| - Example images included | |
| - Real-time visualization with overlays | |
| - Comprehensive error handling | |
| - Professional UI with Gradio 5.x Soft theme | |
| ## π§ Technical Details | |
| ### Import Order (CRITICAL) | |
| ```python | |
| # π¨ spaces MUST be imported FIRST | |
| import spaces | |
| # Then import CUDA packages | |
| import torch | |
| from transformers import AutoModel, AutoProcessor | |
| ``` | |
| This prevents the "CUDA has been initialized" error. | |
| ### FlashAttention Configuration | |
| - Using prebuilt wheel for PyTorch 2.8.0 | |
| - Python 3.10 (cp310) | |
| - CUDA 12 (cu12) | |
| - abiFALSE (REQUIRED - never use abiTRUE) | |
| - URL: https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl | |
| ### Model Loading Strategy | |
| - Models loaded once at startup (outside decorated functions) | |
| - Moved to CUDA device after loading | |
| - GPU-decorated functions only handle inference | |
| - Efficient memory usage | |
| ## π Dependencies Highlights | |
| **Core:** | |
| - gradio==5.49.1 | |
| - torch==2.8.0 | |
| - spaces==0.30.4 | |
| - flash-attn (prebuilt wheel) | |
| **AI/ML:** | |
| - transformers==4.56.2 | |
| - accelerate>=0.28.0 | |
| - timm==1.0.19 | |
| - peft==0.15.2 | |
| **Vision:** | |
| - opencv-python | |
| - pillow>=9.4.0 | |
| - segment-anything (from GitHub) | |
| - pycocotools | |
| ## π¨ UI Structure | |
| ``` | |
| Grasp Any Region (GAR) Demo | |
| βββ Introduction & Links | |
| βββ Tab 1: Points β Describe | |
| β βββ Image upload + points input | |
| β βββ Generate Mask button | |
| β βββ Describe Region button | |
| β βββ Outputs: mask, visualization, description | |
| βββ Tab 2: Box β Describe | |
| β βββ Image upload + box input | |
| β βββ Generate Mask button | |
| β βββ Describe Region button | |
| β βββ Outputs: mask, visualization, description | |
| βββ Tab 3: Mask β Describe | |
| β βββ Image upload + mask upload | |
| β βββ Describe Region button | |
| β βββ Outputs: visualization, description | |
| βββ Documentation & Citation | |
| ``` | |
| ## π How to Run | |
| ```bash | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Run the app | |
| python app.py | |
| ``` | |
| The app will automatically: | |
| 1. Load GAR-1B and SAM models | |
| 2. Launch Gradio interface | |
| 3. Allocate GPU on-demand with ZeroGPU | |
| ## π Expected Performance | |
| - **Model**: GAR-1B (lightweight, fast inference) | |
| - **GPU**: NVIDIA H200, 70GB VRAM | |
| - **Inference Time**: ~10-30 seconds per region (depending on complexity) | |
| - **Max New Tokens**: 1024 (configurable) | |
| ## β οΈ Important Notes | |
| 1. **Import Order**: Always import `spaces` before torch/CUDA packages | |
| 2. **Python Version**: Requires Python 3.10 (for FlashAttention wheel) | |
| 3. **FlashAttention**: Uses prebuilt wheel (no compilation needed) | |
| 4. **Asset Files**: Demo expects images in `assets/` directory | |
| 5. **SingleRegionCaptionDataset**: Required from evaluation module | |
| ## π References | |
| - **Paper**: https://arxiv.org/abs/2510.18876 | |
| - **GitHub**: https://github.com/Haochen-Wang409/Grasp-Any-Region | |
| - **Model**: https://huggingface.co/HaochenWang/GAR-1B | |
| - **SAM**: https://github.com/facebookresearch/segment-anything | |
| ## π Citation | |
| ```bibtex | |
| @article{wang2025grasp, | |
| title={Grasp Any Region: Prompting MLLM to Understand the Dense World}, | |
| author={Haochen Wang et al.}, | |
| journal={arXiv preprint arXiv:2510.18876}, | |
| year={2025} | |
| } | |
| ``` | |
| --- | |
| **Created**: 2025-10-25 | |
| **Status**: β Ready for deployment | |
| **Hardware**: zerogpu (NVIDIA H200, 70GB VRAM) | |