SNIPED_grasp-any-region / GRADIO_APP_SUMMARY.md
jbilcke-hf's picture
Upload core files for paper 2510.18876
46861c5 verified
# Gradio App Summary for Grasp Any Region (GAR)
## βœ… Completion Status
Successfully created a comprehensive Gradio demo for the Grasp Any Region (GAR) project.
## πŸ“ Files Created/Modified
### 1. **app.py** (NEW)
- Complete Gradio interface with 3 tabs:
- **Points β†’ Describe**: Interactive point-based segmentation with SAM
- **Box β†’ Describe**: Bounding box-based segmentation
- **Mask β†’ Describe**: Direct mask upload for region description
- Features:
- ZeroGPU integration with `@spaces.GPU` decorator
- Proper import order (spaces first, then CUDA packages)
- SAM (Segment Anything Model) integration for interactive segmentation
- GAR-1B model for detailed region descriptions
- Visualization with contours and input annotations
- Example images and clear instructions
- Error handling and status messages
### 2. **requirements.txt** (UPDATED)
- Gradio 5.49.1 (required version)
- httpx version fixed to >=0.24.1,<1.0 (Gradio compatibility)
- PyTorch 2.8.0 (pinned for FlashAttention compatibility)
- FlashAttention 2.8.3 prebuilt wheel (PyTorch 2.8, Python 3.10, CUDA 12, abiFALSE)
- spaces==0.30.4 for ZeroGPU
- All original dependencies preserved
- Segment Anything from GitHub
- Vision libraries (opencv-python, pillow, pycocotools)
- Transformers 4.56.2 and supporting ML libraries
## 🎯 Key Features
1. **Three Interaction Modes**:
- Points: Click or enter coordinates to segment regions
- Box: Draw or enter bounding boxes
- Mask: Upload pre-made masks directly
2. **Model Integration**:
- GAR-1B for region understanding (1 billion parameters)
- SAM ViT-Huge for automatic segmentation
- Both models loaded once at startup for efficiency
3. **ZeroGPU Optimization**:
- Proper `@spaces.GPU(duration=120)` decorator usage
- 2-minute GPU allocation per function call
- NVIDIA H200 with 70GB VRAM available
- Critical import order: `spaces` imported before torch
4. **User Experience**:
- Clear step-by-step instructions
- Example images included
- Real-time visualization with overlays
- Comprehensive error handling
- Professional UI with Gradio 5.x Soft theme
## πŸ”§ Technical Details
### Import Order (CRITICAL)
```python
# 🚨 spaces MUST be imported FIRST
import spaces
# Then import CUDA packages
import torch
from transformers import AutoModel, AutoProcessor
```
This prevents the "CUDA has been initialized" error.
### FlashAttention Configuration
- Using prebuilt wheel for PyTorch 2.8.0
- Python 3.10 (cp310)
- CUDA 12 (cu12)
- abiFALSE (REQUIRED - never use abiTRUE)
- URL: https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
### Model Loading Strategy
- Models loaded once at startup (outside decorated functions)
- Moved to CUDA device after loading
- GPU-decorated functions only handle inference
- Efficient memory usage
## πŸ“‹ Dependencies Highlights
**Core:**
- gradio==5.49.1
- torch==2.8.0
- spaces==0.30.4
- flash-attn (prebuilt wheel)
**AI/ML:**
- transformers==4.56.2
- accelerate>=0.28.0
- timm==1.0.19
- peft==0.15.2
**Vision:**
- opencv-python
- pillow>=9.4.0
- segment-anything (from GitHub)
- pycocotools
## 🎨 UI Structure
```
Grasp Any Region (GAR) Demo
β”œβ”€β”€ Introduction & Links
β”œβ”€β”€ Tab 1: Points β†’ Describe
β”‚ β”œβ”€β”€ Image upload + points input
β”‚ β”œβ”€β”€ Generate Mask button
β”‚ β”œβ”€β”€ Describe Region button
β”‚ └── Outputs: mask, visualization, description
β”œβ”€β”€ Tab 2: Box β†’ Describe
β”‚ β”œβ”€β”€ Image upload + box input
β”‚ β”œβ”€β”€ Generate Mask button
β”‚ β”œβ”€β”€ Describe Region button
β”‚ └── Outputs: mask, visualization, description
β”œβ”€β”€ Tab 3: Mask β†’ Describe
β”‚ β”œβ”€β”€ Image upload + mask upload
β”‚ β”œβ”€β”€ Describe Region button
β”‚ └── Outputs: visualization, description
└── Documentation & Citation
```
## πŸš€ How to Run
```bash
# Install dependencies
pip install -r requirements.txt
# Run the app
python app.py
```
The app will automatically:
1. Load GAR-1B and SAM models
2. Launch Gradio interface
3. Allocate GPU on-demand with ZeroGPU
## πŸ“Š Expected Performance
- **Model**: GAR-1B (lightweight, fast inference)
- **GPU**: NVIDIA H200, 70GB VRAM
- **Inference Time**: ~10-30 seconds per region (depending on complexity)
- **Max New Tokens**: 1024 (configurable)
## ⚠️ Important Notes
1. **Import Order**: Always import `spaces` before torch/CUDA packages
2. **Python Version**: Requires Python 3.10 (for FlashAttention wheel)
3. **FlashAttention**: Uses prebuilt wheel (no compilation needed)
4. **Asset Files**: Demo expects images in `assets/` directory
5. **SingleRegionCaptionDataset**: Required from evaluation module
## πŸ”— References
- **Paper**: https://arxiv.org/abs/2510.18876
- **GitHub**: https://github.com/Haochen-Wang409/Grasp-Any-Region
- **Model**: https://huggingface.co/HaochenWang/GAR-1B
- **SAM**: https://github.com/facebookresearch/segment-anything
## πŸ“ Citation
```bibtex
@article{wang2025grasp,
title={Grasp Any Region: Prompting MLLM to Understand the Dense World},
author={Haochen Wang et al.},
journal={arXiv preprint arXiv:2510.18876},
year={2025}
}
```
---
**Created**: 2025-10-25
**Status**: βœ… Ready for deployment
**Hardware**: zerogpu (NVIDIA H200, 70GB VRAM)