Spaces:

jbilcke-hf
/

SNIPED_grasp-any-region

Running on Zero

App Files Files Community

SNIPED_grasp-any-region / GRADIO_APP_SUMMARY.md

jbilcke-hf

Upload core files for paper 2510.18876

46861c5 verified about 1 month ago

preview code

raw

history blame contribute delete

5.38 kB

A newer version of the Gradio SDK is available: 6.0.2

Upgrade

Gradio App Summary for Grasp Any Region (GAR)

✅ Completion Status

Successfully created a comprehensive Gradio demo for the Grasp Any Region (GAR) project.

📁 Files Created/Modified

1. app.py (NEW)

Complete Gradio interface with 3 tabs:
- Points → Describe: Interactive point-based segmentation with SAM
- Box → Describe: Bounding box-based segmentation
- Mask → Describe: Direct mask upload for region description
Features:
- ZeroGPU integration with @spaces.GPU decorator
- Proper import order (spaces first, then CUDA packages)
- SAM (Segment Anything Model) integration for interactive segmentation
- GAR-1B model for detailed region descriptions
- Visualization with contours and input annotations
- Example images and clear instructions
- Error handling and status messages

2. requirements.txt (UPDATED)

Gradio 5.49.1 (required version)
httpx version fixed to >=0.24.1,<1.0 (Gradio compatibility)
PyTorch 2.8.0 (pinned for FlashAttention compatibility)
FlashAttention 2.8.3 prebuilt wheel (PyTorch 2.8, Python 3.10, CUDA 12, abiFALSE)
spaces==0.30.4 for ZeroGPU
All original dependencies preserved
Segment Anything from GitHub
Vision libraries (opencv-python, pillow, pycocotools)
Transformers 4.56.2 and supporting ML libraries

🎯 Key Features

Three Interaction Modes:
- Points: Click or enter coordinates to segment regions
- Box: Draw or enter bounding boxes
- Mask: Upload pre-made masks directly
Model Integration:
- GAR-1B for region understanding (1 billion parameters)
- SAM ViT-Huge for automatic segmentation
- Both models loaded once at startup for efficiency
ZeroGPU Optimization:
- Proper @spaces.GPU(duration=120) decorator usage
- 2-minute GPU allocation per function call
- NVIDIA H200 with 70GB VRAM available
- Critical import order: spaces imported before torch
User Experience:
- Clear step-by-step instructions
- Example images included
- Real-time visualization with overlays
- Comprehensive error handling
- Professional UI with Gradio 5.x Soft theme

🔧 Technical Details

Import Order (CRITICAL)

# 🚨 spaces MUST be imported FIRST
import spaces

# Then import CUDA packages
import torch
from transformers import AutoModel, AutoProcessor

This prevents the "CUDA has been initialized" error.

FlashAttention Configuration

Using prebuilt wheel for PyTorch 2.8.0
Python 3.10 (cp310)
CUDA 12 (cu12)
abiFALSE (REQUIRED - never use abiTRUE)
URL: https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Model Loading Strategy

Models loaded once at startup (outside decorated functions)
Moved to CUDA device after loading
GPU-decorated functions only handle inference
Efficient memory usage

📋 Dependencies Highlights

Core:

gradio==5.49.1
torch==2.8.0
spaces==0.30.4
flash-attn (prebuilt wheel)

AI/ML:

transformers==4.56.2
accelerate>=0.28.0
timm==1.0.19
peft==0.15.2

Vision:

opencv-python
pillow>=9.4.0
segment-anything (from GitHub)
pycocotools

🎨 UI Structure

Grasp Any Region (GAR) Demo
├── Introduction & Links
├── Tab 1: Points → Describe
│   ├── Image upload + points input
│   ├── Generate Mask button
│   ├── Describe Region button
│   └── Outputs: mask, visualization, description
├── Tab 2: Box → Describe
│   ├── Image upload + box input
│   ├── Generate Mask button
│   ├── Describe Region button
│   └── Outputs: mask, visualization, description
├── Tab 3: Mask → Describe
│   ├── Image upload + mask upload
│   ├── Describe Region button
│   └── Outputs: visualization, description
└── Documentation & Citation

🚀 How to Run

# Install dependencies
pip install -r requirements.txt

# Run the app
python app.py

The app will automatically:

Load GAR-1B and SAM models
Launch Gradio interface
Allocate GPU on-demand with ZeroGPU

📊 Expected Performance

Model: GAR-1B (lightweight, fast inference)
GPU: NVIDIA H200, 70GB VRAM
Inference Time: ~10-30 seconds per region (depending on complexity)
Max New Tokens: 1024 (configurable)

⚠️ Important Notes

Import Order: Always import spaces before torch/CUDA packages
Python Version: Requires Python 3.10 (for FlashAttention wheel)
FlashAttention: Uses prebuilt wheel (no compilation needed)
Asset Files: Demo expects images in assets/ directory
SingleRegionCaptionDataset: Required from evaluation module

🔗 References

Paper: https://arxiv.org/abs/2510.18876
GitHub: https://github.com/Haochen-Wang409/Grasp-Any-Region
Model: https://huggingface.co/HaochenWang/GAR-1B
SAM: https://github.com/facebookresearch/segment-anything

📝 Citation

@article{wang2025grasp,
  title={Grasp Any Region: Prompting MLLM to Understand the Dense World},
  author={Haochen Wang et al.},
  journal={arXiv preprint arXiv:2510.18876},
  year={2025}
}

Created: 2025-10-25 Status: ✅ Ready for deployment Hardware: zerogpu (NVIDIA H200, 70GB VRAM)