File size: 5,382 Bytes
46861c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
# Gradio App Summary for Grasp Any Region (GAR)

## βœ… Completion Status

Successfully created a comprehensive Gradio demo for the Grasp Any Region (GAR) project.

## πŸ“ Files Created/Modified

### 1. **app.py** (NEW)
- Complete Gradio interface with 3 tabs:
  - **Points β†’ Describe**: Interactive point-based segmentation with SAM
  - **Box β†’ Describe**: Bounding box-based segmentation
  - **Mask β†’ Describe**: Direct mask upload for region description
- Features:
  - ZeroGPU integration with `@spaces.GPU` decorator
  - Proper import order (spaces first, then CUDA packages)
  - SAM (Segment Anything Model) integration for interactive segmentation
  - GAR-1B model for detailed region descriptions
  - Visualization with contours and input annotations
  - Example images and clear instructions
  - Error handling and status messages

### 2. **requirements.txt** (UPDATED)
- Gradio 5.49.1 (required version)
- httpx version fixed to >=0.24.1,<1.0 (Gradio compatibility)
- PyTorch 2.8.0 (pinned for FlashAttention compatibility)
- FlashAttention 2.8.3 prebuilt wheel (PyTorch 2.8, Python 3.10, CUDA 12, abiFALSE)
- spaces==0.30.4 for ZeroGPU
- All original dependencies preserved
- Segment Anything from GitHub
- Vision libraries (opencv-python, pillow, pycocotools)
- Transformers 4.56.2 and supporting ML libraries

## 🎯 Key Features

1. **Three Interaction Modes**:
   - Points: Click or enter coordinates to segment regions
   - Box: Draw or enter bounding boxes
   - Mask: Upload pre-made masks directly

2. **Model Integration**:
   - GAR-1B for region understanding (1 billion parameters)
   - SAM ViT-Huge for automatic segmentation
   - Both models loaded once at startup for efficiency

3. **ZeroGPU Optimization**:
   - Proper `@spaces.GPU(duration=120)` decorator usage
   - 2-minute GPU allocation per function call
   - NVIDIA H200 with 70GB VRAM available
   - Critical import order: `spaces` imported before torch

4. **User Experience**:
   - Clear step-by-step instructions
   - Example images included
   - Real-time visualization with overlays
   - Comprehensive error handling
   - Professional UI with Gradio 5.x Soft theme

## πŸ”§ Technical Details

### Import Order (CRITICAL)
```python
# 🚨 spaces MUST be imported FIRST
import spaces

# Then import CUDA packages
import torch
from transformers import AutoModel, AutoProcessor
```

This prevents the "CUDA has been initialized" error.

### FlashAttention Configuration
- Using prebuilt wheel for PyTorch 2.8.0
- Python 3.10 (cp310)
- CUDA 12 (cu12)
- abiFALSE (REQUIRED - never use abiTRUE)
- URL: https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

### Model Loading Strategy
- Models loaded once at startup (outside decorated functions)
- Moved to CUDA device after loading
- GPU-decorated functions only handle inference
- Efficient memory usage

## πŸ“‹ Dependencies Highlights

**Core:**
- gradio==5.49.1
- torch==2.8.0
- spaces==0.30.4
- flash-attn (prebuilt wheel)

**AI/ML:**
- transformers==4.56.2
- accelerate>=0.28.0
- timm==1.0.19
- peft==0.15.2

**Vision:**
- opencv-python
- pillow>=9.4.0
- segment-anything (from GitHub)
- pycocotools

## 🎨 UI Structure

```
Grasp Any Region (GAR) Demo
β”œβ”€β”€ Introduction & Links
β”œβ”€β”€ Tab 1: Points β†’ Describe
β”‚   β”œβ”€β”€ Image upload + points input
β”‚   β”œβ”€β”€ Generate Mask button
β”‚   β”œβ”€β”€ Describe Region button
β”‚   └── Outputs: mask, visualization, description
β”œβ”€β”€ Tab 2: Box β†’ Describe
β”‚   β”œβ”€β”€ Image upload + box input
β”‚   β”œβ”€β”€ Generate Mask button
β”‚   β”œβ”€β”€ Describe Region button
β”‚   └── Outputs: mask, visualization, description
β”œβ”€β”€ Tab 3: Mask β†’ Describe
β”‚   β”œβ”€β”€ Image upload + mask upload
β”‚   β”œβ”€β”€ Describe Region button
β”‚   └── Outputs: visualization, description
└── Documentation & Citation
```

## πŸš€ How to Run

```bash
# Install dependencies
pip install -r requirements.txt

# Run the app
python app.py
```

The app will automatically:
1. Load GAR-1B and SAM models
2. Launch Gradio interface
3. Allocate GPU on-demand with ZeroGPU

## πŸ“Š Expected Performance

- **Model**: GAR-1B (lightweight, fast inference)
- **GPU**: NVIDIA H200, 70GB VRAM
- **Inference Time**: ~10-30 seconds per region (depending on complexity)
- **Max New Tokens**: 1024 (configurable)

## ⚠️ Important Notes

1. **Import Order**: Always import `spaces` before torch/CUDA packages
2. **Python Version**: Requires Python 3.10 (for FlashAttention wheel)
3. **FlashAttention**: Uses prebuilt wheel (no compilation needed)
4. **Asset Files**: Demo expects images in `assets/` directory
5. **SingleRegionCaptionDataset**: Required from evaluation module

## πŸ”— References

- **Paper**: https://arxiv.org/abs/2510.18876
- **GitHub**: https://github.com/Haochen-Wang409/Grasp-Any-Region
- **Model**: https://huggingface.co/HaochenWang/GAR-1B
- **SAM**: https://github.com/facebookresearch/segment-anything

## πŸ“ Citation

```bibtex
@article{wang2025grasp,
  title={Grasp Any Region: Prompting MLLM to Understand the Dense World},
  author={Haochen Wang et al.},
  journal={arXiv preprint arXiv:2510.18876},
  year={2025}
}
```

---

**Created**: 2025-10-25
**Status**: βœ… Ready for deployment
**Hardware**: zerogpu (NVIDIA H200, 70GB VRAM)