SNIPED_grasp-any-region / GRADIO_APP_SUMMARY.md
jbilcke-hf's picture
Upload core files for paper 2510.18876
46861c5 verified

A newer version of the Gradio SDK is available: 6.0.2

Upgrade

Gradio App Summary for Grasp Any Region (GAR)

βœ… Completion Status

Successfully created a comprehensive Gradio demo for the Grasp Any Region (GAR) project.

πŸ“ Files Created/Modified

1. app.py (NEW)

  • Complete Gradio interface with 3 tabs:
    • Points β†’ Describe: Interactive point-based segmentation with SAM
    • Box β†’ Describe: Bounding box-based segmentation
    • Mask β†’ Describe: Direct mask upload for region description
  • Features:
    • ZeroGPU integration with @spaces.GPU decorator
    • Proper import order (spaces first, then CUDA packages)
    • SAM (Segment Anything Model) integration for interactive segmentation
    • GAR-1B model for detailed region descriptions
    • Visualization with contours and input annotations
    • Example images and clear instructions
    • Error handling and status messages

2. requirements.txt (UPDATED)

  • Gradio 5.49.1 (required version)
  • httpx version fixed to >=0.24.1,<1.0 (Gradio compatibility)
  • PyTorch 2.8.0 (pinned for FlashAttention compatibility)
  • FlashAttention 2.8.3 prebuilt wheel (PyTorch 2.8, Python 3.10, CUDA 12, abiFALSE)
  • spaces==0.30.4 for ZeroGPU
  • All original dependencies preserved
  • Segment Anything from GitHub
  • Vision libraries (opencv-python, pillow, pycocotools)
  • Transformers 4.56.2 and supporting ML libraries

🎯 Key Features

  1. Three Interaction Modes:

    • Points: Click or enter coordinates to segment regions
    • Box: Draw or enter bounding boxes
    • Mask: Upload pre-made masks directly
  2. Model Integration:

    • GAR-1B for region understanding (1 billion parameters)
    • SAM ViT-Huge for automatic segmentation
    • Both models loaded once at startup for efficiency
  3. ZeroGPU Optimization:

    • Proper @spaces.GPU(duration=120) decorator usage
    • 2-minute GPU allocation per function call
    • NVIDIA H200 with 70GB VRAM available
    • Critical import order: spaces imported before torch
  4. User Experience:

    • Clear step-by-step instructions
    • Example images included
    • Real-time visualization with overlays
    • Comprehensive error handling
    • Professional UI with Gradio 5.x Soft theme

πŸ”§ Technical Details

Import Order (CRITICAL)

# 🚨 spaces MUST be imported FIRST
import spaces

# Then import CUDA packages
import torch
from transformers import AutoModel, AutoProcessor

This prevents the "CUDA has been initialized" error.

FlashAttention Configuration

Model Loading Strategy

  • Models loaded once at startup (outside decorated functions)
  • Moved to CUDA device after loading
  • GPU-decorated functions only handle inference
  • Efficient memory usage

πŸ“‹ Dependencies Highlights

Core:

  • gradio==5.49.1
  • torch==2.8.0
  • spaces==0.30.4
  • flash-attn (prebuilt wheel)

AI/ML:

  • transformers==4.56.2
  • accelerate>=0.28.0
  • timm==1.0.19
  • peft==0.15.2

Vision:

  • opencv-python
  • pillow>=9.4.0
  • segment-anything (from GitHub)
  • pycocotools

🎨 UI Structure

Grasp Any Region (GAR) Demo
β”œβ”€β”€ Introduction & Links
β”œβ”€β”€ Tab 1: Points β†’ Describe
β”‚   β”œβ”€β”€ Image upload + points input
β”‚   β”œβ”€β”€ Generate Mask button
β”‚   β”œβ”€β”€ Describe Region button
β”‚   └── Outputs: mask, visualization, description
β”œβ”€β”€ Tab 2: Box β†’ Describe
β”‚   β”œβ”€β”€ Image upload + box input
β”‚   β”œβ”€β”€ Generate Mask button
β”‚   β”œβ”€β”€ Describe Region button
β”‚   └── Outputs: mask, visualization, description
β”œβ”€β”€ Tab 3: Mask β†’ Describe
β”‚   β”œβ”€β”€ Image upload + mask upload
β”‚   β”œβ”€β”€ Describe Region button
β”‚   └── Outputs: visualization, description
└── Documentation & Citation

πŸš€ How to Run

# Install dependencies
pip install -r requirements.txt

# Run the app
python app.py

The app will automatically:

  1. Load GAR-1B and SAM models
  2. Launch Gradio interface
  3. Allocate GPU on-demand with ZeroGPU

πŸ“Š Expected Performance

  • Model: GAR-1B (lightweight, fast inference)
  • GPU: NVIDIA H200, 70GB VRAM
  • Inference Time: ~10-30 seconds per region (depending on complexity)
  • Max New Tokens: 1024 (configurable)

⚠️ Important Notes

  1. Import Order: Always import spaces before torch/CUDA packages
  2. Python Version: Requires Python 3.10 (for FlashAttention wheel)
  3. FlashAttention: Uses prebuilt wheel (no compilation needed)
  4. Asset Files: Demo expects images in assets/ directory
  5. SingleRegionCaptionDataset: Required from evaluation module

πŸ”— References

πŸ“ Citation

@article{wang2025grasp,
  title={Grasp Any Region: Prompting MLLM to Understand the Dense World},
  author={Haochen Wang et al.},
  journal={arXiv preprint arXiv:2510.18876},
  year={2025}
}

Created: 2025-10-25 Status: βœ… Ready for deployment Hardware: zerogpu (NVIDIA H200, 70GB VRAM)