ml-sharp

Sleeping

App Files Files Community

ml-sharp / WARP.md

Robin L. M. Cheung, MBA

feat: Add local CUDA support, MCP server, Spaces GPU selection, and stacking roadmap

01504c4 3 days ago

preview code

raw

history blame contribute delete

10.8 kB

WARP.md

This file provides guidance to WARP (warp.dev) when working with code in this repository.

Project Overview

SHARP (Single-image 3D Gaussian scene prediction) Gradio demo. Wraps Apple's SHARP model to predict 3D Gaussian scenes from single images, export .ply files, and optionally render camera trajectory videos.

Optimized for local CUDA (4090/3090/3070ti) or HuggingFace Spaces GPU. Includes MCP server for programmatic access.

Development Commands

# Install dependencies (uses uv package manager)
uv sync

# Run the Gradio app (port 49200 by default)
uv run python app.py

# Run MCP server (stdio transport)
uv run python mcp_server.py

# Lint with ruff
uv run ruff check .
uv run ruff format .

Codebase Map

ml-sharp/
├── app.py              # Gradio UI (tabs: Run, Examples, About, Settings)
│   ├── build_demo()    # Main UI builder
│   ├── run_sharp()     # Inference entrypoint called by UI
│   └── discover_examples()  # Load precompiled examples
├── model_utils.py      # Core inference + rendering
│   ├── ModelWrapper    # Checkpoint loading, predictor caching
│   │   ├── predict_to_ply()   # Image → Gaussians → PLY
│   │   └── render_video()     # Gaussians → MP4 trajectory
│   ├── PredictionOutputs      # Dataclass for inference results
│   ├── configure_gpu_mode()   # Switch between local/Spaces GPU
│   └── predict_and_maybe_render_gpu  # Module-level entrypoint
├── hardware_config.py  # GPU hardware selection & persistence
│   ├── HardwareConfig  # Dataclass with mode, hardware, duration
│   ├── get_hardware_choices()  # Dropdown options
│   └── SPACES_HARDWARE_SPECS   # HF Spaces GPU specs & pricing
├── mcp_server.py       # MCP server for programmatic access
│   ├── sharp_predict   # Tool: image → PLY + video
│   ├── list_outputs    # Tool: list generated files
│   └── sharp://info    # Resource: GPU status, config
├── assets/examples/    # Precompiled example outputs
├── outputs/            # Runtime outputs (PLY, MP4)
├── .hardware_config.json  # Persisted hardware settings
├── pyproject.toml      # Dependencies (uv)
└── WARP.md             # This file

Data Flow

Image → load_rgb() → predict_image() → Gaussians3D → save_ply() → PLY
                                              ↓
                                      render_video() → MP4

Architecture

Core Files

app.py — Gradio UI with tabs for Run/Examples/About/Settings. Handles example discovery from assets/examples/ via manifest.json or filename conventions.
model_utils.py — SHARP model wrapper with checkpoint loading (HF Hub → CDN fallback), inference via predict_to_ply(), and CUDA video rendering via render_video().
hardware_config.py — GPU hardware selection between local CUDA and HuggingFace Spaces. Persists to .hardware_config.json.
mcp_server.py — MCP server exposing sharp_predict tool and sharp://info resource.

Key Patterns

Local CUDA mode: Model kept on GPU by default (SHARP_KEEP_MODEL_ON_DEVICE=1) for better performance on dedicated GPUs.

Spaces GPU mode: Uses @spaces.GPU decorator for dynamic GPU allocation on HuggingFace Spaces. Configurable via Settings tab.

Checkpoint resolution order:

SHARP_CHECKPOINT_PATH env var
HF Hub cache
HF Hub download
Upstream CDN via torch.hub

Video rendering: Requires CUDA (gsplat). Falls back gracefully on CPU-only systems by returning None for video path.

Environment Variables

Variable	Default	Description
`SHARP_PORT`	`49200`	Gradio server port
`SHARP_MCP_PORT`	`49201`	MCP server port
`SHARP_CHECKPOINT_PATH`	—	Override local checkpoint path
`SHARP_HF_REPO_ID`	`apple/Sharp`	HuggingFace repo
`SHARP_HF_FILENAME`	`sharp_2572gikvuh.pt`	Checkpoint filename
`SHARP_KEEP_MODEL_ON_DEVICE`	`1`	Keep model on GPU (set `0` to free VRAM)
`CUDA_VISIBLE_DEVICES`	—	GPU selection (e.g., `0` or `0,1`)

Gradio API

API is enabled by default. Access at http://localhost:49200/?view=api.

Endpoint: `/api/run_sharp`

import requests

response = requests.post(
    "http://localhost:49200/api/run_sharp",
    json={
        "data": [
            "/path/to/image.jpg",  # image_path
            "rotate_forward",       # trajectory_type
            0,                       # output_long_side (0 = match input)
            60,                      # num_frames
            30,                      # fps
            True,                    # render_video
        ]
    }
)
result = response.json()["data"]
video_path, ply_path, status = result

MCP Server

Run the MCP server for integration with AI agents:

uv run python mcp_server.py

MCP Config (for clients like Warp)

{
  "mcpServers": {
    "sharp": {
      "command": "uv",
      "args": ["run", "python", "mcp_server.py"],
      "cwd": "/home/robin/CascadeProjects/ml-sharp"
    }
  }
}

Tools

sharp_predict(image_path, render_video=True, trajectory_type="rotate_forward", ...) — Run inference
list_outputs() — List generated PLY/MP4 files

Resources

sharp://info — GPU status, configuration
sharp://help — Usage documentation

Multi-GPU Configuration

Select GPU via environment variable:

# Use GPU 0 (e.g., 4090)
CUDA_VISIBLE_DEVICES=0 uv run python app.py

# Use GPU 1 (e.g., 3090)
CUDA_VISIBLE_DEVICES=1 uv run python app.py

HuggingFace Spaces GPU

The app supports HuggingFace Spaces paid GPUs for faster inference or larger models. Configure via the Settings tab.

Available Hardware

Hardware	VRAM	Price/hr	Best For
ZeroGPU (H200)	70GB	Free (PRO)	Demos, dynamic allocation
T4 small	16GB	$0.40	Light workloads
T4 medium	16GB	$0.60	Standard workloads
L4x1	24GB	$0.80	Standard inference
L4x4	96GB	$3.80	Multi-GPU
L40Sx1	48GB	$1.80	Large models
L40Sx4	192GB	$8.30	Very large models
A10G small	24GB	$1.00	Balanced
A10G large	24GB	$1.50	More CPU/RAM
A100 large	80GB	$2.50	Maximum VRAM

Deploying to Spaces

Push to HuggingFace Space
Set hardware in Space settings (or use suggested_hardware in README.md)
The app auto-detects Spaces environment via SPACE_ID env var

README.md Metadata for Spaces

---
title: SHARP - 3D Gaussian Scene Prediction
emoji: 🔪
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 6.2.0
python_version: 3.13.11
app_file: app.py
suggested_hardware: l4x1  # or zero-gpu, a100-large, etc.
startup_duration_timeout: 1h
preload_from_hub:
- apple/Sharp sharp_2572gikvuh.pt
---

Examples System

Place precompiled outputs in assets/examples/:

<name>.{jpg,png,webp} + <name>.mp4 + <name>.ply
Or define assets/examples/manifest.json with {label, image, video, ply} entries

Multi-Image Stacking Roadmap

SHARP predicts 3D Gaussians from a single image. To "stack" multiple images into a unified scene:

Required Components

Pose Estimation (multi_view.py)
- Estimate relative camera poses between images
- Options: COLMAP, hloc, or PnP-based
- Transform each prediction to common world frame
Gaussian Merging (gaussian_merge.py)
- Concatenate Gaussian parameters (means, covariances, colors, opacities)
- Deduplicate overlapping regions via density-based filtering
- Optional: fine-tune merged scene with photometric loss
UI Changes
- Multi-upload widget
- Alignment preview/validation
- Progress indicator for multi-image processing

Data Structures

@dataclass
class AlignedGaussians:
    gaussians: Gaussians3D
    world_transform: torch.Tensor  # 4x4 SE(3)
    source_image: Path

def merge_gaussians(aligned: list[AlignedGaussians]) -> Gaussians3D:
    # 1. Transform each Gaussian's means by world_transform
    # 2. Concatenate all parameters
    # 3. Density-based pruning in overlapping regions
    ...

Dependencies to Add

pycolmap or hloc for pose estimation
open3d for point cloud operations (optional)

Implementation Phases

Phase 1: Basic Multi-Image Pipeline

Add multi_view.py with estimate_relative_pose(img1, img2) using feature matching
Add gaussian_merge.py with naive concatenation (no dedup)
UI: Multi-file upload in new "Stack" tab
Export merged PLY

Phase 2: Pose Estimation Options

Integrate COLMAP sparse reconstruction for >2 images
Add hloc (Hierarchical Localization) as lightweight alternative
Fallback: manual pose input for known camera rigs

Phase 3: Gaussian Deduplication

Implement KD-tree based nearest-neighbor pruning
Merge overlapping Gaussians by averaging parameters
Add confidence weighting based on view angle

Phase 4: Refinement (Optional)

Photometric loss optimization on merged scene
Iterative alignment refinement
Support for depth priors from stereo/MVS

API Design

# multi_view.py
def estimate_poses(
    images: list[Path],
    method: Literal["colmap", "hloc", "pnp"] = "hloc",
) -> list[np.ndarray]:  # List of 4x4 world-to-camera transforms
    ...

# gaussian_merge.py
def merge_scenes(
    predictions: list[PredictionOutputs],
    poses: list[np.ndarray],
    deduplicate: bool = True,
    dedup_radius: float = 0.01,  # meters
) -> Gaussians3D:
    ...

# app.py (Stack tab)
def run_stack(
    images: list[str],  # Gradio multi-file upload
    pose_method: str,
    deduplicate: bool,
) -> tuple[str | None, str | None, str]:  # video, ply, status
    ...

MCP Extension

# mcp_server.py additions
@mcp.tool()
def sharp_stack(
    image_paths: list[str],
    pose_method: str = "hloc",
    deduplicate: bool = True,
    render_video: bool = True,
) -> dict:
    """Stack multiple images into unified 3D Gaussian scene."""
    ...

Technical Considerations

Coordinate Systems:

SHARP outputs Gaussians in camera-centric coordinates
Need to transform to world frame using estimated poses
Convention: Y-up, -Z forward (OpenGL style)

Memory Management:

Each SHARP prediction ~50-200MB GPU memory
Batch processing with model unload between predictions
Consider streaming merge for >10 images

Quality Metrics:

Reprojection error for pose validation
Gaussian density histogram for coverage analysis
Visual comparison with ground truth (if available)