ml-sharp / WARP.md
Robin L. M. Cheung, MBA
feat: Add local CUDA support, MCP server, Spaces GPU selection, and stacking roadmap
01504c4
# WARP.md
This file provides guidance to WARP (warp.dev) when working with code in this repository.
## Project Overview
SHARP (Single-image 3D Gaussian scene prediction) Gradio demo. Wraps Apple's SHARP model to predict 3D Gaussian scenes from single images, export `.ply` files, and optionally render camera trajectory videos.
Optimized for local CUDA (4090/3090/3070ti) or HuggingFace Spaces GPU. Includes MCP server for programmatic access.
## Development Commands
```bash
# Install dependencies (uses uv package manager)
uv sync
# Run the Gradio app (port 49200 by default)
uv run python app.py
# Run MCP server (stdio transport)
uv run python mcp_server.py
# Lint with ruff
uv run ruff check .
uv run ruff format .
```
## Codebase Map
```
ml-sharp/
β”œβ”€β”€ app.py # Gradio UI (tabs: Run, Examples, About, Settings)
β”‚ β”œβ”€β”€ build_demo() # Main UI builder
β”‚ β”œβ”€β”€ run_sharp() # Inference entrypoint called by UI
β”‚ └── discover_examples() # Load precompiled examples
β”œβ”€β”€ model_utils.py # Core inference + rendering
β”‚ β”œβ”€β”€ ModelWrapper # Checkpoint loading, predictor caching
β”‚ β”‚ β”œβ”€β”€ predict_to_ply() # Image β†’ Gaussians β†’ PLY
β”‚ β”‚ └── render_video() # Gaussians β†’ MP4 trajectory
β”‚ β”œβ”€β”€ PredictionOutputs # Dataclass for inference results
β”‚ β”œβ”€β”€ configure_gpu_mode() # Switch between local/Spaces GPU
β”‚ └── predict_and_maybe_render_gpu # Module-level entrypoint
β”œβ”€β”€ hardware_config.py # GPU hardware selection & persistence
β”‚ β”œβ”€β”€ HardwareConfig # Dataclass with mode, hardware, duration
β”‚ β”œβ”€β”€ get_hardware_choices() # Dropdown options
β”‚ └── SPACES_HARDWARE_SPECS # HF Spaces GPU specs & pricing
β”œβ”€β”€ mcp_server.py # MCP server for programmatic access
β”‚ β”œβ”€β”€ sharp_predict # Tool: image β†’ PLY + video
β”‚ β”œβ”€β”€ list_outputs # Tool: list generated files
β”‚ └── sharp://info # Resource: GPU status, config
β”œβ”€β”€ assets/examples/ # Precompiled example outputs
β”œβ”€β”€ outputs/ # Runtime outputs (PLY, MP4)
β”œβ”€β”€ .hardware_config.json # Persisted hardware settings
β”œβ”€β”€ pyproject.toml # Dependencies (uv)
└── WARP.md # This file
```
### Data Flow
```
Image β†’ load_rgb() β†’ predict_image() β†’ Gaussians3D β†’ save_ply() β†’ PLY
↓
render_video() β†’ MP4
```
## Architecture
### Core Files
- `app.py` β€” Gradio UI with tabs for Run/Examples/About/Settings. Handles example discovery from `assets/examples/` via manifest.json or filename conventions.
- `model_utils.py` β€” SHARP model wrapper with checkpoint loading (HF Hub β†’ CDN fallback), inference via `predict_to_ply()`, and CUDA video rendering via `render_video()`.
- `hardware_config.py` β€” GPU hardware selection between local CUDA and HuggingFace Spaces. Persists to `.hardware_config.json`.
- `mcp_server.py` β€” MCP server exposing `sharp_predict` tool and `sharp://info` resource.
### Key Patterns
**Local CUDA mode**: Model kept on GPU by default (`SHARP_KEEP_MODEL_ON_DEVICE=1`) for better performance on dedicated GPUs.
**Spaces GPU mode**: Uses `@spaces.GPU` decorator for dynamic GPU allocation on HuggingFace Spaces. Configurable via Settings tab.
**Checkpoint resolution order**:
1. `SHARP_CHECKPOINT_PATH` env var
2. HF Hub cache
3. HF Hub download
4. Upstream CDN via `torch.hub`
**Video rendering**: Requires CUDA (gsplat). Falls back gracefully on CPU-only systems by returning `None` for video path.
## Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `SHARP_PORT` | `49200` | Gradio server port |
| `SHARP_MCP_PORT` | `49201` | MCP server port |
| `SHARP_CHECKPOINT_PATH` | β€” | Override local checkpoint path |
| `SHARP_HF_REPO_ID` | `apple/Sharp` | HuggingFace repo |
| `SHARP_HF_FILENAME` | `sharp_2572gikvuh.pt` | Checkpoint filename |
| `SHARP_KEEP_MODEL_ON_DEVICE` | `1` | Keep model on GPU (set `0` to free VRAM) |
| `CUDA_VISIBLE_DEVICES` | β€” | GPU selection (e.g., `0` or `0,1`) |
## Gradio API
API is enabled by default. Access at `http://localhost:49200/?view=api`.
### Endpoint: `/api/run_sharp`
```python
import requests
response = requests.post(
"http://localhost:49200/api/run_sharp",
json={
"data": [
"/path/to/image.jpg", # image_path
"rotate_forward", # trajectory_type
0, # output_long_side (0 = match input)
60, # num_frames
30, # fps
True, # render_video
]
}
)
result = response.json()["data"]
video_path, ply_path, status = result
```
## MCP Server
Run the MCP server for integration with AI agents:
```bash
uv run python mcp_server.py
```
### MCP Config (for clients like Warp)
```json
{
"mcpServers": {
"sharp": {
"command": "uv",
"args": ["run", "python", "mcp_server.py"],
"cwd": "/home/robin/CascadeProjects/ml-sharp"
}
}
}
```
### Tools
- `sharp_predict(image_path, render_video=True, trajectory_type="rotate_forward", ...)` β€” Run inference
- `list_outputs()` β€” List generated PLY/MP4 files
### Resources
- `sharp://info` β€” GPU status, configuration
- `sharp://help` β€” Usage documentation
## Multi-GPU Configuration
Select GPU via environment variable:
```bash
# Use GPU 0 (e.g., 4090)
CUDA_VISIBLE_DEVICES=0 uv run python app.py
# Use GPU 1 (e.g., 3090)
CUDA_VISIBLE_DEVICES=1 uv run python app.py
```
## HuggingFace Spaces GPU
The app supports HuggingFace Spaces paid GPUs for faster inference or larger models. Configure via the **Settings** tab.
### Available Hardware
| Hardware | VRAM | Price/hr | Best For |
|----------|------|----------|----------|
| ZeroGPU (H200) | 70GB | Free (PRO) | Demos, dynamic allocation |
| T4 small | 16GB | $0.40 | Light workloads |
| T4 medium | 16GB | $0.60 | Standard workloads |
| L4x1 | 24GB | $0.80 | Standard inference |
| L4x4 | 96GB | $3.80 | Multi-GPU |
| L40Sx1 | 48GB | $1.80 | Large models |
| L40Sx4 | 192GB | $8.30 | Very large models |
| A10G small | 24GB | $1.00 | Balanced |
| A10G large | 24GB | $1.50 | More CPU/RAM |
| A100 large | 80GB | $2.50 | Maximum VRAM |
### Deploying to Spaces
1. Push to HuggingFace Space
2. Set hardware in Space settings (or use `suggested_hardware` in README.md)
3. The app auto-detects Spaces environment via `SPACE_ID` env var
### README.md Metadata for Spaces
```yaml
---
title: SHARP - 3D Gaussian Scene Prediction
emoji: πŸ”ͺ
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 6.2.0
python_version: 3.13.11
app_file: app.py
suggested_hardware: l4x1 # or zero-gpu, a100-large, etc.
startup_duration_timeout: 1h
preload_from_hub:
- apple/Sharp sharp_2572gikvuh.pt
---
```
## Examples System
Place precompiled outputs in `assets/examples/`:
- `<name>.{jpg,png,webp}` + `<name>.mp4` + `<name>.ply`
- Or define `assets/examples/manifest.json` with `{label, image, video, ply}` entries
## Multi-Image Stacking Roadmap
SHARP predicts 3D Gaussians from a single image. To "stack" multiple images into a unified scene:
### Required Components
1. **Pose Estimation** (`multi_view.py`)
- Estimate relative camera poses between images
- Options: COLMAP, hloc, or PnP-based
- Transform each prediction to common world frame
2. **Gaussian Merging** (`gaussian_merge.py`)
- Concatenate Gaussian parameters (means, covariances, colors, opacities)
- Deduplicate overlapping regions via density-based filtering
- Optional: fine-tune merged scene with photometric loss
3. **UI Changes**
- Multi-upload widget
- Alignment preview/validation
- Progress indicator for multi-image processing
### Data Structures
```python
@dataclass
class AlignedGaussians:
gaussians: Gaussians3D
world_transform: torch.Tensor # 4x4 SE(3)
source_image: Path
def merge_gaussians(aligned: list[AlignedGaussians]) -> Gaussians3D:
# 1. Transform each Gaussian's means by world_transform
# 2. Concatenate all parameters
# 3. Density-based pruning in overlapping regions
...
```
### Dependencies to Add
- `pycolmap` or `hloc` for pose estimation
- `open3d` for point cloud operations (optional)
### Implementation Phases
#### Phase 1: Basic Multi-Image Pipeline
- [ ] Add `multi_view.py` with `estimate_relative_pose(img1, img2)` using feature matching
- [ ] Add `gaussian_merge.py` with naive concatenation (no dedup)
- [ ] UI: Multi-file upload in new "Stack" tab
- [ ] Export merged PLY
#### Phase 2: Pose Estimation Options
- [ ] Integrate COLMAP sparse reconstruction for >2 images
- [ ] Add hloc (Hierarchical Localization) as lightweight alternative
- [ ] Fallback: manual pose input for known camera rigs
#### Phase 3: Gaussian Deduplication
- [ ] Implement KD-tree based nearest-neighbor pruning
- [ ] Merge overlapping Gaussians by averaging parameters
- [ ] Add confidence weighting based on view angle
#### Phase 4: Refinement (Optional)
- [ ] Photometric loss optimization on merged scene
- [ ] Iterative alignment refinement
- [ ] Support for depth priors from stereo/MVS
### API Design
```python
# multi_view.py
def estimate_poses(
images: list[Path],
method: Literal["colmap", "hloc", "pnp"] = "hloc",
) -> list[np.ndarray]: # List of 4x4 world-to-camera transforms
...
# gaussian_merge.py
def merge_scenes(
predictions: list[PredictionOutputs],
poses: list[np.ndarray],
deduplicate: bool = True,
dedup_radius: float = 0.01, # meters
) -> Gaussians3D:
...
# app.py (Stack tab)
def run_stack(
images: list[str], # Gradio multi-file upload
pose_method: str,
deduplicate: bool,
) -> tuple[str | None, str | None, str]: # video, ply, status
...
```
### MCP Extension
```python
# mcp_server.py additions
@mcp.tool()
def sharp_stack(
image_paths: list[str],
pose_method: str = "hloc",
deduplicate: bool = True,
render_video: bool = True,
) -> dict:
"""Stack multiple images into unified 3D Gaussian scene."""
...
```
### Technical Considerations
**Coordinate Systems**:
- SHARP outputs Gaussians in camera-centric coordinates
- Need to transform to world frame using estimated poses
- Convention: Y-up, -Z forward (OpenGL style)
**Memory Management**:
- Each SHARP prediction ~50-200MB GPU memory
- Batch processing with model unload between predictions
- Consider streaming merge for >10 images
**Quality Metrics**:
- Reprojection error for pose validation
- Gaussian density histogram for coverage analysis
- Visual comparison with ground truth (if available)