ml-sharp / WARP.md
Robin L. M. Cheung, MBA
feat: Add local CUDA support, MCP server, Spaces GPU selection, and stacking roadmap
01504c4

WARP.md

This file provides guidance to WARP (warp.dev) when working with code in this repository.

Project Overview

SHARP (Single-image 3D Gaussian scene prediction) Gradio demo. Wraps Apple's SHARP model to predict 3D Gaussian scenes from single images, export .ply files, and optionally render camera trajectory videos.

Optimized for local CUDA (4090/3090/3070ti) or HuggingFace Spaces GPU. Includes MCP server for programmatic access.

Development Commands

# Install dependencies (uses uv package manager)
uv sync

# Run the Gradio app (port 49200 by default)
uv run python app.py

# Run MCP server (stdio transport)
uv run python mcp_server.py

# Lint with ruff
uv run ruff check .
uv run ruff format .

Codebase Map

ml-sharp/
β”œβ”€β”€ app.py              # Gradio UI (tabs: Run, Examples, About, Settings)
β”‚   β”œβ”€β”€ build_demo()    # Main UI builder
β”‚   β”œβ”€β”€ run_sharp()     # Inference entrypoint called by UI
β”‚   └── discover_examples()  # Load precompiled examples
β”œβ”€β”€ model_utils.py      # Core inference + rendering
β”‚   β”œβ”€β”€ ModelWrapper    # Checkpoint loading, predictor caching
β”‚   β”‚   β”œβ”€β”€ predict_to_ply()   # Image β†’ Gaussians β†’ PLY
β”‚   β”‚   └── render_video()     # Gaussians β†’ MP4 trajectory
β”‚   β”œβ”€β”€ PredictionOutputs      # Dataclass for inference results
β”‚   β”œβ”€β”€ configure_gpu_mode()   # Switch between local/Spaces GPU
β”‚   └── predict_and_maybe_render_gpu  # Module-level entrypoint
β”œβ”€β”€ hardware_config.py  # GPU hardware selection & persistence
β”‚   β”œβ”€β”€ HardwareConfig  # Dataclass with mode, hardware, duration
β”‚   β”œβ”€β”€ get_hardware_choices()  # Dropdown options
β”‚   └── SPACES_HARDWARE_SPECS   # HF Spaces GPU specs & pricing
β”œβ”€β”€ mcp_server.py       # MCP server for programmatic access
β”‚   β”œβ”€β”€ sharp_predict   # Tool: image β†’ PLY + video
β”‚   β”œβ”€β”€ list_outputs    # Tool: list generated files
β”‚   └── sharp://info    # Resource: GPU status, config
β”œβ”€β”€ assets/examples/    # Precompiled example outputs
β”œβ”€β”€ outputs/            # Runtime outputs (PLY, MP4)
β”œβ”€β”€ .hardware_config.json  # Persisted hardware settings
β”œβ”€β”€ pyproject.toml      # Dependencies (uv)
└── WARP.md             # This file

Data Flow

Image β†’ load_rgb() β†’ predict_image() β†’ Gaussians3D β†’ save_ply() β†’ PLY
                                              ↓
                                      render_video() β†’ MP4

Architecture

Core Files

  • app.py β€” Gradio UI with tabs for Run/Examples/About/Settings. Handles example discovery from assets/examples/ via manifest.json or filename conventions.
  • model_utils.py β€” SHARP model wrapper with checkpoint loading (HF Hub β†’ CDN fallback), inference via predict_to_ply(), and CUDA video rendering via render_video().
  • hardware_config.py β€” GPU hardware selection between local CUDA and HuggingFace Spaces. Persists to .hardware_config.json.
  • mcp_server.py β€” MCP server exposing sharp_predict tool and sharp://info resource.

Key Patterns

Local CUDA mode: Model kept on GPU by default (SHARP_KEEP_MODEL_ON_DEVICE=1) for better performance on dedicated GPUs.

Spaces GPU mode: Uses @spaces.GPU decorator for dynamic GPU allocation on HuggingFace Spaces. Configurable via Settings tab.

Checkpoint resolution order:

  1. SHARP_CHECKPOINT_PATH env var
  2. HF Hub cache
  3. HF Hub download
  4. Upstream CDN via torch.hub

Video rendering: Requires CUDA (gsplat). Falls back gracefully on CPU-only systems by returning None for video path.

Environment Variables

Variable Default Description
SHARP_PORT 49200 Gradio server port
SHARP_MCP_PORT 49201 MCP server port
SHARP_CHECKPOINT_PATH β€” Override local checkpoint path
SHARP_HF_REPO_ID apple/Sharp HuggingFace repo
SHARP_HF_FILENAME sharp_2572gikvuh.pt Checkpoint filename
SHARP_KEEP_MODEL_ON_DEVICE 1 Keep model on GPU (set 0 to free VRAM)
CUDA_VISIBLE_DEVICES β€” GPU selection (e.g., 0 or 0,1)

Gradio API

API is enabled by default. Access at http://localhost:49200/?view=api.

Endpoint: /api/run_sharp

import requests

response = requests.post(
    "http://localhost:49200/api/run_sharp",
    json={
        "data": [
            "/path/to/image.jpg",  # image_path
            "rotate_forward",       # trajectory_type
            0,                       # output_long_side (0 = match input)
            60,                      # num_frames
            30,                      # fps
            True,                    # render_video
        ]
    }
)
result = response.json()["data"]
video_path, ply_path, status = result

MCP Server

Run the MCP server for integration with AI agents:

uv run python mcp_server.py

MCP Config (for clients like Warp)

{
  "mcpServers": {
    "sharp": {
      "command": "uv",
      "args": ["run", "python", "mcp_server.py"],
      "cwd": "/home/robin/CascadeProjects/ml-sharp"
    }
  }
}

Tools

  • sharp_predict(image_path, render_video=True, trajectory_type="rotate_forward", ...) β€” Run inference
  • list_outputs() β€” List generated PLY/MP4 files

Resources

  • sharp://info β€” GPU status, configuration
  • sharp://help β€” Usage documentation

Multi-GPU Configuration

Select GPU via environment variable:

# Use GPU 0 (e.g., 4090)
CUDA_VISIBLE_DEVICES=0 uv run python app.py

# Use GPU 1 (e.g., 3090)
CUDA_VISIBLE_DEVICES=1 uv run python app.py

HuggingFace Spaces GPU

The app supports HuggingFace Spaces paid GPUs for faster inference or larger models. Configure via the Settings tab.

Available Hardware

Hardware VRAM Price/hr Best For
ZeroGPU (H200) 70GB Free (PRO) Demos, dynamic allocation
T4 small 16GB $0.40 Light workloads
T4 medium 16GB $0.60 Standard workloads
L4x1 24GB $0.80 Standard inference
L4x4 96GB $3.80 Multi-GPU
L40Sx1 48GB $1.80 Large models
L40Sx4 192GB $8.30 Very large models
A10G small 24GB $1.00 Balanced
A10G large 24GB $1.50 More CPU/RAM
A100 large 80GB $2.50 Maximum VRAM

Deploying to Spaces

  1. Push to HuggingFace Space
  2. Set hardware in Space settings (or use suggested_hardware in README.md)
  3. The app auto-detects Spaces environment via SPACE_ID env var

README.md Metadata for Spaces

---
title: SHARP - 3D Gaussian Scene Prediction
emoji: πŸ”ͺ
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 6.2.0
python_version: 3.13.11
app_file: app.py
suggested_hardware: l4x1  # or zero-gpu, a100-large, etc.
startup_duration_timeout: 1h
preload_from_hub:
- apple/Sharp sharp_2572gikvuh.pt
---

Examples System

Place precompiled outputs in assets/examples/:

  • <name>.{jpg,png,webp} + <name>.mp4 + <name>.ply
  • Or define assets/examples/manifest.json with {label, image, video, ply} entries

Multi-Image Stacking Roadmap

SHARP predicts 3D Gaussians from a single image. To "stack" multiple images into a unified scene:

Required Components

  1. Pose Estimation (multi_view.py)

    • Estimate relative camera poses between images
    • Options: COLMAP, hloc, or PnP-based
    • Transform each prediction to common world frame
  2. Gaussian Merging (gaussian_merge.py)

    • Concatenate Gaussian parameters (means, covariances, colors, opacities)
    • Deduplicate overlapping regions via density-based filtering
    • Optional: fine-tune merged scene with photometric loss
  3. UI Changes

    • Multi-upload widget
    • Alignment preview/validation
    • Progress indicator for multi-image processing

Data Structures

@dataclass
class AlignedGaussians:
    gaussians: Gaussians3D
    world_transform: torch.Tensor  # 4x4 SE(3)
    source_image: Path

def merge_gaussians(aligned: list[AlignedGaussians]) -> Gaussians3D:
    # 1. Transform each Gaussian's means by world_transform
    # 2. Concatenate all parameters
    # 3. Density-based pruning in overlapping regions
    ...

Dependencies to Add

  • pycolmap or hloc for pose estimation
  • open3d for point cloud operations (optional)

Implementation Phases

Phase 1: Basic Multi-Image Pipeline

  • Add multi_view.py with estimate_relative_pose(img1, img2) using feature matching
  • Add gaussian_merge.py with naive concatenation (no dedup)
  • UI: Multi-file upload in new "Stack" tab
  • Export merged PLY

Phase 2: Pose Estimation Options

  • Integrate COLMAP sparse reconstruction for >2 images
  • Add hloc (Hierarchical Localization) as lightweight alternative
  • Fallback: manual pose input for known camera rigs

Phase 3: Gaussian Deduplication

  • Implement KD-tree based nearest-neighbor pruning
  • Merge overlapping Gaussians by averaging parameters
  • Add confidence weighting based on view angle

Phase 4: Refinement (Optional)

  • Photometric loss optimization on merged scene
  • Iterative alignment refinement
  • Support for depth priors from stereo/MVS

API Design

# multi_view.py
def estimate_poses(
    images: list[Path],
    method: Literal["colmap", "hloc", "pnp"] = "hloc",
) -> list[np.ndarray]:  # List of 4x4 world-to-camera transforms
    ...

# gaussian_merge.py
def merge_scenes(
    predictions: list[PredictionOutputs],
    poses: list[np.ndarray],
    deduplicate: bool = True,
    dedup_radius: float = 0.01,  # meters
) -> Gaussians3D:
    ...

# app.py (Stack tab)
def run_stack(
    images: list[str],  # Gradio multi-file upload
    pose_method: str,
    deduplicate: bool,
) -> tuple[str | None, str | None, str]:  # video, ply, status
    ...

MCP Extension

# mcp_server.py additions
@mcp.tool()
def sharp_stack(
    image_paths: list[str],
    pose_method: str = "hloc",
    deduplicate: bool = True,
    render_video: bool = True,
) -> dict:
    """Stack multiple images into unified 3D Gaussian scene."""
    ...

Technical Considerations

Coordinate Systems:

  • SHARP outputs Gaussians in camera-centric coordinates
  • Need to transform to world frame using estimated poses
  • Convention: Y-up, -Z forward (OpenGL style)

Memory Management:

  • Each SHARP prediction ~50-200MB GPU memory
  • Batch processing with model unload between predictions
  • Consider streaming merge for >10 images

Quality Metrics:

  • Reprojection error for pose validation
  • Gaussian density histogram for coverage analysis
  • Visual comparison with ground truth (if available)