ml-sharp

Paused

File size: 10,794 Bytes

01504c4

# WARP.md

This file provides guidance to WARP (warp.dev) when working with code in this repository.

## Project Overview

SHARP (Single-image 3D Gaussian scene prediction) Gradio demo. Wraps Apple's SHARP model to predict 3D Gaussian scenes from single images, export `.ply` files, and optionally render camera trajectory videos.

Optimized for local CUDA (4090/3090/3070ti) or HuggingFace Spaces GPU. Includes MCP server for programmatic access.

## Development Commands

```bash
# Install dependencies (uses uv package manager)
uv sync

# Run the Gradio app (port 49200 by default)
uv run python app.py

# Run MCP server (stdio transport)
uv run python mcp_server.py

# Lint with ruff
uv run ruff check .
uv run ruff format .
```

## Codebase Map

```
ml-sharp/
├── app.py              # Gradio UI (tabs: Run, Examples, About, Settings)
│   ├── build_demo()    # Main UI builder
│   ├── run_sharp()     # Inference entrypoint called by UI
│   └── discover_examples()  # Load precompiled examples
├── model_utils.py      # Core inference + rendering
│   ├── ModelWrapper    # Checkpoint loading, predictor caching
│   │   ├── predict_to_ply()   # Image → Gaussians → PLY
│   │   └── render_video()     # Gaussians → MP4 trajectory
│   ├── PredictionOutputs      # Dataclass for inference results
│   ├── configure_gpu_mode()   # Switch between local/Spaces GPU
│   └── predict_and_maybe_render_gpu  # Module-level entrypoint
├── hardware_config.py  # GPU hardware selection & persistence
│   ├── HardwareConfig  # Dataclass with mode, hardware, duration
│   ├── get_hardware_choices()  # Dropdown options
│   └── SPACES_HARDWARE_SPECS   # HF Spaces GPU specs & pricing
├── mcp_server.py       # MCP server for programmatic access
│   ├── sharp_predict   # Tool: image → PLY + video
│   ├── list_outputs    # Tool: list generated files
│   └── sharp://info    # Resource: GPU status, config
├── assets/examples/    # Precompiled example outputs
├── outputs/            # Runtime outputs (PLY, MP4)
├── .hardware_config.json  # Persisted hardware settings
├── pyproject.toml      # Dependencies (uv)
└── WARP.md             # This file
```

### Data Flow

```
Image → load_rgb() → predict_image() → Gaussians3D → save_ply() → PLY
                                              ↓
                                      render_video() → MP4
```

## Architecture

### Core Files

- `app.py` — Gradio UI with tabs for Run/Examples/About/Settings. Handles example discovery from `assets/examples/` via manifest.json or filename conventions.
- `model_utils.py` — SHARP model wrapper with checkpoint loading (HF Hub → CDN fallback), inference via `predict_to_ply()`, and CUDA video rendering via `render_video()`.
- `hardware_config.py` — GPU hardware selection between local CUDA and HuggingFace Spaces. Persists to `.hardware_config.json`.
- `mcp_server.py` — MCP server exposing `sharp_predict` tool and `sharp://info` resource.

### Key Patterns

**Local CUDA mode**: Model kept on GPU by default (`SHARP_KEEP_MODEL_ON_DEVICE=1`) for better performance on dedicated GPUs.

**Spaces GPU mode**: Uses `@spaces.GPU` decorator for dynamic GPU allocation on HuggingFace Spaces. Configurable via Settings tab.

**Checkpoint resolution order**: 
1. `SHARP_CHECKPOINT_PATH` env var
2. HF Hub cache
3. HF Hub download
4. Upstream CDN via `torch.hub`

**Video rendering**: Requires CUDA (gsplat). Falls back gracefully on CPU-only systems by returning `None` for video path.

## Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `SHARP_PORT` | `49200` | Gradio server port |
| `SHARP_MCP_PORT` | `49201` | MCP server port |
| `SHARP_CHECKPOINT_PATH` | — | Override local checkpoint path |
| `SHARP_HF_REPO_ID` | `apple/Sharp` | HuggingFace repo |
| `SHARP_HF_FILENAME` | `sharp_2572gikvuh.pt` | Checkpoint filename |
| `SHARP_KEEP_MODEL_ON_DEVICE` | `1` | Keep model on GPU (set `0` to free VRAM) |
| `CUDA_VISIBLE_DEVICES` | — | GPU selection (e.g., `0` or `0,1`) |

## Gradio API

API is enabled by default. Access at `http://localhost:49200/?view=api`.

### Endpoint: `/api/run_sharp`

```python
import requests

response = requests.post(
    "http://localhost:49200/api/run_sharp",
    json={
        "data": [
            "/path/to/image.jpg",  # image_path
            "rotate_forward",       # trajectory_type
            0,                       # output_long_side (0 = match input)
            60,                      # num_frames
            30,                      # fps
            True,                    # render_video
        ]
    }
)
result = response.json()["data"]
video_path, ply_path, status = result
```

## MCP Server

Run the MCP server for integration with AI agents:

```bash
uv run python mcp_server.py
```

### MCP Config (for clients like Warp)

```json
{
  "mcpServers": {
    "sharp": {
      "command": "uv",
      "args": ["run", "python", "mcp_server.py"],
      "cwd": "/home/robin/CascadeProjects/ml-sharp"
    }
  }
}
```

### Tools

- `sharp_predict(image_path, render_video=True, trajectory_type="rotate_forward", ...)` — Run inference
- `list_outputs()` — List generated PLY/MP4 files

### Resources

- `sharp://info` — GPU status, configuration
- `sharp://help` — Usage documentation

## Multi-GPU Configuration

Select GPU via environment variable:

```bash
# Use GPU 0 (e.g., 4090)
CUDA_VISIBLE_DEVICES=0 uv run python app.py

# Use GPU 1 (e.g., 3090)
CUDA_VISIBLE_DEVICES=1 uv run python app.py
```

## HuggingFace Spaces GPU

The app supports HuggingFace Spaces paid GPUs for faster inference or larger models. Configure via the **Settings** tab.

### Available Hardware

| Hardware | VRAM | Price/hr | Best For |
|----------|------|----------|----------|
| ZeroGPU (H200) | 70GB | Free (PRO) | Demos, dynamic allocation |
| T4 small | 16GB | $0.40 | Light workloads |
| T4 medium | 16GB | $0.60 | Standard workloads |
| L4x1 | 24GB | $0.80 | Standard inference |
| L4x4 | 96GB | $3.80 | Multi-GPU |
| L40Sx1 | 48GB | $1.80 | Large models |
| L40Sx4 | 192GB | $8.30 | Very large models |
| A10G small | 24GB | $1.00 | Balanced |
| A10G large | 24GB | $1.50 | More CPU/RAM |
| A100 large | 80GB | $2.50 | Maximum VRAM |

### Deploying to Spaces

1. Push to HuggingFace Space
2. Set hardware in Space settings (or use `suggested_hardware` in README.md)
3. The app auto-detects Spaces environment via `SPACE_ID` env var

### README.md Metadata for Spaces

```yaml
---
title: SHARP - 3D Gaussian Scene Prediction
emoji: 🔪
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 6.2.0
python_version: 3.13.11
app_file: app.py
suggested_hardware: l4x1  # or zero-gpu, a100-large, etc.
startup_duration_timeout: 1h
preload_from_hub:
- apple/Sharp sharp_2572gikvuh.pt
---
```

## Examples System

Place precompiled outputs in `assets/examples/`:
- `<name>.{jpg,png,webp}` + `<name>.mp4` + `<name>.ply`
- Or define `assets/examples/manifest.json` with `{label, image, video, ply}` entries

## Multi-Image Stacking Roadmap

SHARP predicts 3D Gaussians from a single image. To "stack" multiple images into a unified scene:

### Required Components

1. **Pose Estimation** (`multi_view.py`)
   - Estimate relative camera poses between images
   - Options: COLMAP, hloc, or PnP-based
   - Transform each prediction to common world frame

2. **Gaussian Merging** (`gaussian_merge.py`)
   - Concatenate Gaussian parameters (means, covariances, colors, opacities)
   - Deduplicate overlapping regions via density-based filtering
   - Optional: fine-tune merged scene with photometric loss

3. **UI Changes**
   - Multi-upload widget
   - Alignment preview/validation
   - Progress indicator for multi-image processing

### Data Structures

```python
@dataclass
class AlignedGaussians:
    gaussians: Gaussians3D
    world_transform: torch.Tensor  # 4x4 SE(3)
    source_image: Path

def merge_gaussians(aligned: list[AlignedGaussians]) -> Gaussians3D:
    # 1. Transform each Gaussian's means by world_transform
    # 2. Concatenate all parameters
    # 3. Density-based pruning in overlapping regions
    ...
```

### Dependencies to Add

- `pycolmap` or `hloc` for pose estimation
- `open3d` for point cloud operations (optional)

### Implementation Phases

#### Phase 1: Basic Multi-Image Pipeline
- [ ] Add `multi_view.py` with `estimate_relative_pose(img1, img2)` using feature matching
- [ ] Add `gaussian_merge.py` with naive concatenation (no dedup)
- [ ] UI: Multi-file upload in new "Stack" tab
- [ ] Export merged PLY

#### Phase 2: Pose Estimation Options
- [ ] Integrate COLMAP sparse reconstruction for >2 images
- [ ] Add hloc (Hierarchical Localization) as lightweight alternative
- [ ] Fallback: manual pose input for known camera rigs

#### Phase 3: Gaussian Deduplication
- [ ] Implement KD-tree based nearest-neighbor pruning
- [ ] Merge overlapping Gaussians by averaging parameters
- [ ] Add confidence weighting based on view angle

#### Phase 4: Refinement (Optional)
- [ ] Photometric loss optimization on merged scene
- [ ] Iterative alignment refinement
- [ ] Support for depth priors from stereo/MVS

### API Design

```python
# multi_view.py
def estimate_poses(
    images: list[Path],
    method: Literal["colmap", "hloc", "pnp"] = "hloc",
) -> list[np.ndarray]:  # List of 4x4 world-to-camera transforms
    ...

# gaussian_merge.py
def merge_scenes(
    predictions: list[PredictionOutputs],
    poses: list[np.ndarray],
    deduplicate: bool = True,
    dedup_radius: float = 0.01,  # meters
) -> Gaussians3D:
    ...

# app.py (Stack tab)
def run_stack(
    images: list[str],  # Gradio multi-file upload
    pose_method: str,
    deduplicate: bool,
) -> tuple[str | None, str | None, str]:  # video, ply, status
    ...
```

### MCP Extension

```python
# mcp_server.py additions
@mcp.tool()
def sharp_stack(
    image_paths: list[str],
    pose_method: str = "hloc",
    deduplicate: bool = True,
    render_video: bool = True,
) -> dict:
    """Stack multiple images into unified 3D Gaussian scene."""
    ...
```

### Technical Considerations

**Coordinate Systems**:
- SHARP outputs Gaussians in camera-centric coordinates
- Need to transform to world frame using estimated poses
- Convention: Y-up, -Z forward (OpenGL style)

**Memory Management**:
- Each SHARP prediction ~50-200MB GPU memory
- Batch processing with model unload between predictions
- Consider streaming merge for >10 images

**Quality Metrics**:
- Reprojection error for pose validation
- Gaussian density histogram for coverage analysis
- Visual comparison with ground truth (if available)