Spaces:

jbilcke-hf
/

SNIPED_grasp-any-region

Running on Zero

App Files Files Community

SNIPED_grasp-any-region / GRADIO_APP_SUMMARY.md

jbilcke-hf

Upload core files for paper 2510.18876

46861c5 verified about 1 month ago

preview code

raw

history blame contribute delete

5.38 kB

	# Gradio App Summary for Grasp Any Region (GAR)

	## ✅ Completion Status

	Successfully created a comprehensive Gradio demo for the Grasp Any Region (GAR) project.

	## 📁 Files Created/Modified

	### 1. app.py (NEW)
	- Complete Gradio interface with 3 tabs:
	- Points → Describe: Interactive point-based segmentation with SAM
	- Box → Describe: Bounding box-based segmentation
	- Mask → Describe: Direct mask upload for region description
	- Features:
	- ZeroGPU integration with `@spaces.GPU` decorator
	- Proper import order (spaces first, then CUDA packages)
	- SAM (Segment Anything Model) integration for interactive segmentation
	- GAR-1B model for detailed region descriptions
	- Visualization with contours and input annotations
	- Example images and clear instructions
	- Error handling and status messages

	### 2. requirements.txt (UPDATED)
	- Gradio 5.49.1 (required version)
	- httpx version fixed to >=0.24.1,<1.0 (Gradio compatibility)
	- PyTorch 2.8.0 (pinned for FlashAttention compatibility)
	- FlashAttention 2.8.3 prebuilt wheel (PyTorch 2.8, Python 3.10, CUDA 12, abiFALSE)
	- spaces==0.30.4 for ZeroGPU
	- All original dependencies preserved
	- Segment Anything from GitHub
	- Vision libraries (opencv-python, pillow, pycocotools)
	- Transformers 4.56.2 and supporting ML libraries

	## 🎯 Key Features

	1. Three Interaction Modes:
	- Points: Click or enter coordinates to segment regions
	- Box: Draw or enter bounding boxes
	- Mask: Upload pre-made masks directly

	2. Model Integration:
	- GAR-1B for region understanding (1 billion parameters)
	- SAM ViT-Huge for automatic segmentation
	- Both models loaded once at startup for efficiency

	3. ZeroGPU Optimization:
	- Proper `@spaces.GPU(duration=120)` decorator usage
	- 2-minute GPU allocation per function call
	- NVIDIA H200 with 70GB VRAM available
	- Critical import order: `spaces` imported before torch

	4. User Experience:
	- Clear step-by-step instructions
	- Example images included
	- Real-time visualization with overlays
	- Comprehensive error handling
	- Professional UI with Gradio 5.x Soft theme

	## 🔧 Technical Details

	### Import Order (CRITICAL)
	```python
	# 🚨 spaces MUST be imported FIRST
	import spaces

	# Then import CUDA packages
	import torch
	from transformers import AutoModel, AutoProcessor
	```

	This prevents the "CUDA has been initialized" error.

	### FlashAttention Configuration
	- Using prebuilt wheel for PyTorch 2.8.0
	- Python 3.10 (cp310)
	- CUDA 12 (cu12)
	- abiFALSE (REQUIRED - never use abiTRUE)
	- URL: https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

	### Model Loading Strategy
	- Models loaded once at startup (outside decorated functions)
	- Moved to CUDA device after loading
	- GPU-decorated functions only handle inference
	- Efficient memory usage

	## 📋 Dependencies Highlights

	Core:
	- gradio==5.49.1
	- torch==2.8.0
	- spaces==0.30.4
	- flash-attn (prebuilt wheel)

	AI/ML:
	- transformers==4.56.2
	- accelerate>=0.28.0
	- timm==1.0.19
	- peft==0.15.2

	Vision:
	- opencv-python
	- pillow>=9.4.0
	- segment-anything (from GitHub)
	- pycocotools

	## 🎨 UI Structure

	```
	Grasp Any Region (GAR) Demo
	├── Introduction & Links
	├── Tab 1: Points → Describe
	│ ├── Image upload + points input
	│ ├── Generate Mask button
	│ ├── Describe Region button
	│ └── Outputs: mask, visualization, description
	├── Tab 2: Box → Describe
	│ ├── Image upload + box input
	│ ├── Generate Mask button
	│ ├── Describe Region button
	│ └── Outputs: mask, visualization, description
	├── Tab 3: Mask → Describe
	│ ├── Image upload + mask upload
	│ ├── Describe Region button
	│ └── Outputs: visualization, description
	└── Documentation & Citation
	```

	## 🚀 How to Run

	```bash
	# Install dependencies
	pip install -r requirements.txt

	# Run the app
	python app.py
	```

	The app will automatically:
	1. Load GAR-1B and SAM models
	2. Launch Gradio interface
	3. Allocate GPU on-demand with ZeroGPU

	## 📊 Expected Performance

	- Model: GAR-1B (lightweight, fast inference)
	- GPU: NVIDIA H200, 70GB VRAM
	- Inference Time: ~10-30 seconds per region (depending on complexity)
	- Max New Tokens: 1024 (configurable)

	## ⚠️ Important Notes

	1. Import Order: Always import `spaces` before torch/CUDA packages
	2. Python Version: Requires Python 3.10 (for FlashAttention wheel)
	3. FlashAttention: Uses prebuilt wheel (no compilation needed)
	4. Asset Files: Demo expects images in `assets/` directory
	5. SingleRegionCaptionDataset: Required from evaluation module

	## 🔗 References

	- Paper: https://arxiv.org/abs/2510.18876
	- GitHub: https://github.com/Haochen-Wang409/Grasp-Any-Region
	- Model: https://huggingface.co/HaochenWang/GAR-1B
	- SAM: https://github.com/facebookresearch/segment-anything

	## 📝 Citation

	```bibtex
	@article{wang2025grasp,
	title={Grasp Any Region: Prompting MLLM to Understand the Dense World},
	author={Haochen Wang et al.},
	journal={arXiv preprint arXiv:2510.18876},
	year={2025}
	}
	```

	---

	Created: 2025-10-25
	Status: ✅ Ready for deployment
	Hardware: zerogpu (NVIDIA H200, 70GB VRAM)