Spaces:
Running
on
Zero
Running
on
Zero
A newer version of the Gradio SDK is available:
6.0.2
Gradio App Summary for Grasp Any Region (GAR)
β Completion Status
Successfully created a comprehensive Gradio demo for the Grasp Any Region (GAR) project.
π Files Created/Modified
1. app.py (NEW)
- Complete Gradio interface with 3 tabs:
- Points β Describe: Interactive point-based segmentation with SAM
- Box β Describe: Bounding box-based segmentation
- Mask β Describe: Direct mask upload for region description
- Features:
- ZeroGPU integration with
@spaces.GPUdecorator - Proper import order (spaces first, then CUDA packages)
- SAM (Segment Anything Model) integration for interactive segmentation
- GAR-1B model for detailed region descriptions
- Visualization with contours and input annotations
- Example images and clear instructions
- Error handling and status messages
- ZeroGPU integration with
2. requirements.txt (UPDATED)
- Gradio 5.49.1 (required version)
- httpx version fixed to >=0.24.1,<1.0 (Gradio compatibility)
- PyTorch 2.8.0 (pinned for FlashAttention compatibility)
- FlashAttention 2.8.3 prebuilt wheel (PyTorch 2.8, Python 3.10, CUDA 12, abiFALSE)
- spaces==0.30.4 for ZeroGPU
- All original dependencies preserved
- Segment Anything from GitHub
- Vision libraries (opencv-python, pillow, pycocotools)
- Transformers 4.56.2 and supporting ML libraries
π― Key Features
Three Interaction Modes:
- Points: Click or enter coordinates to segment regions
- Box: Draw or enter bounding boxes
- Mask: Upload pre-made masks directly
Model Integration:
- GAR-1B for region understanding (1 billion parameters)
- SAM ViT-Huge for automatic segmentation
- Both models loaded once at startup for efficiency
ZeroGPU Optimization:
- Proper
@spaces.GPU(duration=120)decorator usage - 2-minute GPU allocation per function call
- NVIDIA H200 with 70GB VRAM available
- Critical import order:
spacesimported before torch
- Proper
User Experience:
- Clear step-by-step instructions
- Example images included
- Real-time visualization with overlays
- Comprehensive error handling
- Professional UI with Gradio 5.x Soft theme
π§ Technical Details
Import Order (CRITICAL)
# π¨ spaces MUST be imported FIRST
import spaces
# Then import CUDA packages
import torch
from transformers import AutoModel, AutoProcessor
This prevents the "CUDA has been initialized" error.
FlashAttention Configuration
- Using prebuilt wheel for PyTorch 2.8.0
- Python 3.10 (cp310)
- CUDA 12 (cu12)
- abiFALSE (REQUIRED - never use abiTRUE)
- URL: https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
Model Loading Strategy
- Models loaded once at startup (outside decorated functions)
- Moved to CUDA device after loading
- GPU-decorated functions only handle inference
- Efficient memory usage
π Dependencies Highlights
Core:
- gradio==5.49.1
- torch==2.8.0
- spaces==0.30.4
- flash-attn (prebuilt wheel)
AI/ML:
- transformers==4.56.2
- accelerate>=0.28.0
- timm==1.0.19
- peft==0.15.2
Vision:
- opencv-python
- pillow>=9.4.0
- segment-anything (from GitHub)
- pycocotools
π¨ UI Structure
Grasp Any Region (GAR) Demo
βββ Introduction & Links
βββ Tab 1: Points β Describe
β βββ Image upload + points input
β βββ Generate Mask button
β βββ Describe Region button
β βββ Outputs: mask, visualization, description
βββ Tab 2: Box β Describe
β βββ Image upload + box input
β βββ Generate Mask button
β βββ Describe Region button
β βββ Outputs: mask, visualization, description
βββ Tab 3: Mask β Describe
β βββ Image upload + mask upload
β βββ Describe Region button
β βββ Outputs: visualization, description
βββ Documentation & Citation
π How to Run
# Install dependencies
pip install -r requirements.txt
# Run the app
python app.py
The app will automatically:
- Load GAR-1B and SAM models
- Launch Gradio interface
- Allocate GPU on-demand with ZeroGPU
π Expected Performance
- Model: GAR-1B (lightweight, fast inference)
- GPU: NVIDIA H200, 70GB VRAM
- Inference Time: ~10-30 seconds per region (depending on complexity)
- Max New Tokens: 1024 (configurable)
β οΈ Important Notes
- Import Order: Always import
spacesbefore torch/CUDA packages - Python Version: Requires Python 3.10 (for FlashAttention wheel)
- FlashAttention: Uses prebuilt wheel (no compilation needed)
- Asset Files: Demo expects images in
assets/directory - SingleRegionCaptionDataset: Required from evaluation module
π References
- Paper: https://arxiv.org/abs/2510.18876
- GitHub: https://github.com/Haochen-Wang409/Grasp-Any-Region
- Model: https://huggingface.co/HaochenWang/GAR-1B
- SAM: https://github.com/facebookresearch/segment-anything
π Citation
@article{wang2025grasp,
title={Grasp Any Region: Prompting MLLM to Understand the Dense World},
author={Haochen Wang et al.},
journal={arXiv preprint arXiv:2510.18876},
year={2025}
}
Created: 2025-10-25 Status: β Ready for deployment Hardware: zerogpu (NVIDIA H200, 70GB VRAM)