Spaces:
Running
on
Zero
Running
on
Zero
File size: 5,382 Bytes
46861c5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
# Gradio App Summary for Grasp Any Region (GAR)
## β
Completion Status
Successfully created a comprehensive Gradio demo for the Grasp Any Region (GAR) project.
## π Files Created/Modified
### 1. **app.py** (NEW)
- Complete Gradio interface with 3 tabs:
- **Points β Describe**: Interactive point-based segmentation with SAM
- **Box β Describe**: Bounding box-based segmentation
- **Mask β Describe**: Direct mask upload for region description
- Features:
- ZeroGPU integration with `@spaces.GPU` decorator
- Proper import order (spaces first, then CUDA packages)
- SAM (Segment Anything Model) integration for interactive segmentation
- GAR-1B model for detailed region descriptions
- Visualization with contours and input annotations
- Example images and clear instructions
- Error handling and status messages
### 2. **requirements.txt** (UPDATED)
- Gradio 5.49.1 (required version)
- httpx version fixed to >=0.24.1,<1.0 (Gradio compatibility)
- PyTorch 2.8.0 (pinned for FlashAttention compatibility)
- FlashAttention 2.8.3 prebuilt wheel (PyTorch 2.8, Python 3.10, CUDA 12, abiFALSE)
- spaces==0.30.4 for ZeroGPU
- All original dependencies preserved
- Segment Anything from GitHub
- Vision libraries (opencv-python, pillow, pycocotools)
- Transformers 4.56.2 and supporting ML libraries
## π― Key Features
1. **Three Interaction Modes**:
- Points: Click or enter coordinates to segment regions
- Box: Draw or enter bounding boxes
- Mask: Upload pre-made masks directly
2. **Model Integration**:
- GAR-1B for region understanding (1 billion parameters)
- SAM ViT-Huge for automatic segmentation
- Both models loaded once at startup for efficiency
3. **ZeroGPU Optimization**:
- Proper `@spaces.GPU(duration=120)` decorator usage
- 2-minute GPU allocation per function call
- NVIDIA H200 with 70GB VRAM available
- Critical import order: `spaces` imported before torch
4. **User Experience**:
- Clear step-by-step instructions
- Example images included
- Real-time visualization with overlays
- Comprehensive error handling
- Professional UI with Gradio 5.x Soft theme
## π§ Technical Details
### Import Order (CRITICAL)
```python
# π¨ spaces MUST be imported FIRST
import spaces
# Then import CUDA packages
import torch
from transformers import AutoModel, AutoProcessor
```
This prevents the "CUDA has been initialized" error.
### FlashAttention Configuration
- Using prebuilt wheel for PyTorch 2.8.0
- Python 3.10 (cp310)
- CUDA 12 (cu12)
- abiFALSE (REQUIRED - never use abiTRUE)
- URL: https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
### Model Loading Strategy
- Models loaded once at startup (outside decorated functions)
- Moved to CUDA device after loading
- GPU-decorated functions only handle inference
- Efficient memory usage
## π Dependencies Highlights
**Core:**
- gradio==5.49.1
- torch==2.8.0
- spaces==0.30.4
- flash-attn (prebuilt wheel)
**AI/ML:**
- transformers==4.56.2
- accelerate>=0.28.0
- timm==1.0.19
- peft==0.15.2
**Vision:**
- opencv-python
- pillow>=9.4.0
- segment-anything (from GitHub)
- pycocotools
## π¨ UI Structure
```
Grasp Any Region (GAR) Demo
βββ Introduction & Links
βββ Tab 1: Points β Describe
β βββ Image upload + points input
β βββ Generate Mask button
β βββ Describe Region button
β βββ Outputs: mask, visualization, description
βββ Tab 2: Box β Describe
β βββ Image upload + box input
β βββ Generate Mask button
β βββ Describe Region button
β βββ Outputs: mask, visualization, description
βββ Tab 3: Mask β Describe
β βββ Image upload + mask upload
β βββ Describe Region button
β βββ Outputs: visualization, description
βββ Documentation & Citation
```
## π How to Run
```bash
# Install dependencies
pip install -r requirements.txt
# Run the app
python app.py
```
The app will automatically:
1. Load GAR-1B and SAM models
2. Launch Gradio interface
3. Allocate GPU on-demand with ZeroGPU
## π Expected Performance
- **Model**: GAR-1B (lightweight, fast inference)
- **GPU**: NVIDIA H200, 70GB VRAM
- **Inference Time**: ~10-30 seconds per region (depending on complexity)
- **Max New Tokens**: 1024 (configurable)
## β οΈ Important Notes
1. **Import Order**: Always import `spaces` before torch/CUDA packages
2. **Python Version**: Requires Python 3.10 (for FlashAttention wheel)
3. **FlashAttention**: Uses prebuilt wheel (no compilation needed)
4. **Asset Files**: Demo expects images in `assets/` directory
5. **SingleRegionCaptionDataset**: Required from evaluation module
## π References
- **Paper**: https://arxiv.org/abs/2510.18876
- **GitHub**: https://github.com/Haochen-Wang409/Grasp-Any-Region
- **Model**: https://huggingface.co/HaochenWang/GAR-1B
- **SAM**: https://github.com/facebookresearch/segment-anything
## π Citation
```bibtex
@article{wang2025grasp,
title={Grasp Any Region: Prompting MLLM to Understand the Dense World},
author={Haochen Wang et al.},
journal={arXiv preprint arXiv:2510.18876},
year={2025}
}
```
---
**Created**: 2025-10-25
**Status**: β
Ready for deployment
**Hardware**: zerogpu (NVIDIA H200, 70GB VRAM)
|