# Gradio App Summary for Grasp Any Region (GAR)

## ✅ Completion Status

Successfully created a comprehensive Gradio demo for the Grasp Any Region (GAR) project.

## 📁 Files Created/Modified

### 1. **app.py** (NEW)
- Complete Gradio interface with 3 tabs:
  - **Points → Describe**: Interactive point-based segmentation with SAM
  - **Box → Describe**: Bounding box-based segmentation
  - **Mask → Describe**: Direct mask upload for region description
- Features:
  - ZeroGPU integration with `@spaces.GPU` decorator
  - Proper import order (spaces first, then CUDA packages)
  - SAM (Segment Anything Model) integration for interactive segmentation
  - GAR-1B model for detailed region descriptions
  - Visualization with contours and input annotations
  - Example images and clear instructions
  - Error handling and status messages

### 2. **requirements.txt** (UPDATED)
- Gradio 5.49.1 (required version)
- httpx version fixed to >=0.24.1,<1.0 (Gradio compatibility)
- PyTorch 2.8.0 (pinned for FlashAttention compatibility)
- FlashAttention 2.8.3 prebuilt wheel (PyTorch 2.8, Python 3.10, CUDA 12, abiFALSE)
- spaces==0.30.4 for ZeroGPU
- All original dependencies preserved
- Segment Anything from GitHub
- Vision libraries (opencv-python, pillow, pycocotools)
- Transformers 4.56.2 and supporting ML libraries

## 🎯 Key Features

1. **Three Interaction Modes**:
   - Points: Click or enter coordinates to segment regions
   - Box: Draw or enter bounding boxes
   - Mask: Upload pre-made masks directly

2. **Model Integration**:
   - GAR-1B for region understanding (1 billion parameters)
   - SAM ViT-Huge for automatic segmentation
   - Both models loaded once at startup for efficiency

3. **ZeroGPU Optimization**:
   - Proper `@spaces.GPU(duration=120)` decorator usage
   - 2-minute GPU allocation per function call
   - NVIDIA H200 with 70GB VRAM available
   - Critical import order: `spaces` imported before torch

4. **User Experience**:
   - Clear step-by-step instructions
   - Example images included
   - Real-time visualization with overlays
   - Comprehensive error handling
   - Professional UI with Gradio 5.x Soft theme

## 🔧 Technical Details

### Import Order (CRITICAL)
```python
# 🚨 spaces MUST be imported FIRST
import spaces

# Then import CUDA packages
import torch
from transformers import AutoModel, AutoProcessor
```

This prevents the "CUDA has been initialized" error.

### FlashAttention Configuration
- Using prebuilt wheel for PyTorch 2.8.0
- Python 3.10 (cp310)
- CUDA 12 (cu12)
- abiFALSE (REQUIRED - never use abiTRUE)
- URL: https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

### Model Loading Strategy
- Models loaded once at startup (outside decorated functions)
- Moved to CUDA device after loading
- GPU-decorated functions only handle inference
- Efficient memory usage

## 📋 Dependencies Highlights

**Core:**
- gradio==5.49.1
- torch==2.8.0
- spaces==0.30.4
- flash-attn (prebuilt wheel)

**AI/ML:**
- transformers==4.56.2
- accelerate>=0.28.0
- timm==1.0.19
- peft==0.15.2

**Vision:**
- opencv-python
- pillow>=9.4.0
- segment-anything (from GitHub)
- pycocotools

## 🎨 UI Structure

```
Grasp Any Region (GAR) Demo
├── Introduction & Links
├── Tab 1: Points → Describe
│   ├── Image upload + points input
│   ├── Generate Mask button
│   ├── Describe Region button
│   └── Outputs: mask, visualization, description
├── Tab 2: Box → Describe
│   ├── Image upload + box input
│   ├── Generate Mask button
│   ├── Describe Region button
│   └── Outputs: mask, visualization, description
├── Tab 3: Mask → Describe
│   ├── Image upload + mask upload
│   ├── Describe Region button
│   └── Outputs: visualization, description
└── Documentation & Citation
```

## 🚀 How to Run

```bash
# Install dependencies
pip install -r requirements.txt

# Run the app
python app.py
```

The app will automatically:
1. Load GAR-1B and SAM models
2. Launch Gradio interface
3. Allocate GPU on-demand with ZeroGPU

## 📊 Expected Performance

- **Model**: GAR-1B (lightweight, fast inference)
- **GPU**: NVIDIA H200, 70GB VRAM
- **Inference Time**: ~10-30 seconds per region (depending on complexity)
- **Max New Tokens**: 1024 (configurable)

## ⚠️ Important Notes

1. **Import Order**: Always import `spaces` before torch/CUDA packages
2. **Python Version**: Requires Python 3.10 (for FlashAttention wheel)
3. **FlashAttention**: Uses prebuilt wheel (no compilation needed)
4. **Asset Files**: Demo expects images in `assets/` directory
5. **SingleRegionCaptionDataset**: Required from evaluation module

## 🔗 References

- **Paper**: https://arxiv.org/abs/2510.18876
- **GitHub**: https://github.com/Haochen-Wang409/Grasp-Any-Region
- **Model**: https://huggingface.co/HaochenWang/GAR-1B
- **SAM**: https://github.com/facebookresearch/segment-anything

## 📝 Citation

```bibtex
@article{wang2025grasp,
  title={Grasp Any Region: Prompting MLLM to Understand the Dense World},
  author={Haochen Wang et al.},
  journal={arXiv preprint arXiv:2510.18876},
  year={2025}
}
```

---

**Created**: 2025-10-25
**Status**: ✅ Ready for deployment
**Hardware**: zerogpu (NVIDIA H200, 70GB VRAM)