Spaces:
Sleeping
WARP.md
This file provides guidance to WARP (warp.dev) when working with code in this repository.
Project Overview
SHARP (Single-image 3D Gaussian scene prediction) Gradio demo. Wraps Apple's SHARP model to predict 3D Gaussian scenes from single images, export .ply files, and optionally render camera trajectory videos.
Optimized for local CUDA (4090/3090/3070ti) or HuggingFace Spaces GPU. Includes MCP server for programmatic access.
Development Commands
# Install dependencies (uses uv package manager)
uv sync
# Run the Gradio app (port 49200 by default)
uv run python app.py
# Run MCP server (stdio transport)
uv run python mcp_server.py
# Lint with ruff
uv run ruff check .
uv run ruff format .
Codebase Map
ml-sharp/
βββ app.py # Gradio UI (tabs: Run, Examples, About, Settings)
β βββ build_demo() # Main UI builder
β βββ run_sharp() # Inference entrypoint called by UI
β βββ discover_examples() # Load precompiled examples
βββ model_utils.py # Core inference + rendering
β βββ ModelWrapper # Checkpoint loading, predictor caching
β β βββ predict_to_ply() # Image β Gaussians β PLY
β β βββ render_video() # Gaussians β MP4 trajectory
β βββ PredictionOutputs # Dataclass for inference results
β βββ configure_gpu_mode() # Switch between local/Spaces GPU
β βββ predict_and_maybe_render_gpu # Module-level entrypoint
βββ hardware_config.py # GPU hardware selection & persistence
β βββ HardwareConfig # Dataclass with mode, hardware, duration
β βββ get_hardware_choices() # Dropdown options
β βββ SPACES_HARDWARE_SPECS # HF Spaces GPU specs & pricing
βββ mcp_server.py # MCP server for programmatic access
β βββ sharp_predict # Tool: image β PLY + video
β βββ list_outputs # Tool: list generated files
β βββ sharp://info # Resource: GPU status, config
βββ assets/examples/ # Precompiled example outputs
βββ outputs/ # Runtime outputs (PLY, MP4)
βββ .hardware_config.json # Persisted hardware settings
βββ pyproject.toml # Dependencies (uv)
βββ WARP.md # This file
Data Flow
Image β load_rgb() β predict_image() β Gaussians3D β save_ply() β PLY
β
render_video() β MP4
Architecture
Core Files
app.pyβ Gradio UI with tabs for Run/Examples/About/Settings. Handles example discovery fromassets/examples/via manifest.json or filename conventions.model_utils.pyβ SHARP model wrapper with checkpoint loading (HF Hub β CDN fallback), inference viapredict_to_ply(), and CUDA video rendering viarender_video().hardware_config.pyβ GPU hardware selection between local CUDA and HuggingFace Spaces. Persists to.hardware_config.json.mcp_server.pyβ MCP server exposingsharp_predicttool andsharp://inforesource.
Key Patterns
Local CUDA mode: Model kept on GPU by default (SHARP_KEEP_MODEL_ON_DEVICE=1) for better performance on dedicated GPUs.
Spaces GPU mode: Uses @spaces.GPU decorator for dynamic GPU allocation on HuggingFace Spaces. Configurable via Settings tab.
Checkpoint resolution order:
SHARP_CHECKPOINT_PATHenv var- HF Hub cache
- HF Hub download
- Upstream CDN via
torch.hub
Video rendering: Requires CUDA (gsplat). Falls back gracefully on CPU-only systems by returning None for video path.
Environment Variables
| Variable | Default | Description |
|---|---|---|
SHARP_PORT |
49200 |
Gradio server port |
SHARP_MCP_PORT |
49201 |
MCP server port |
SHARP_CHECKPOINT_PATH |
β | Override local checkpoint path |
SHARP_HF_REPO_ID |
apple/Sharp |
HuggingFace repo |
SHARP_HF_FILENAME |
sharp_2572gikvuh.pt |
Checkpoint filename |
SHARP_KEEP_MODEL_ON_DEVICE |
1 |
Keep model on GPU (set 0 to free VRAM) |
CUDA_VISIBLE_DEVICES |
β | GPU selection (e.g., 0 or 0,1) |
Gradio API
API is enabled by default. Access at http://localhost:49200/?view=api.
Endpoint: /api/run_sharp
import requests
response = requests.post(
"http://localhost:49200/api/run_sharp",
json={
"data": [
"/path/to/image.jpg", # image_path
"rotate_forward", # trajectory_type
0, # output_long_side (0 = match input)
60, # num_frames
30, # fps
True, # render_video
]
}
)
result = response.json()["data"]
video_path, ply_path, status = result
MCP Server
Run the MCP server for integration with AI agents:
uv run python mcp_server.py
MCP Config (for clients like Warp)
{
"mcpServers": {
"sharp": {
"command": "uv",
"args": ["run", "python", "mcp_server.py"],
"cwd": "/home/robin/CascadeProjects/ml-sharp"
}
}
}
Tools
sharp_predict(image_path, render_video=True, trajectory_type="rotate_forward", ...)β Run inferencelist_outputs()β List generated PLY/MP4 files
Resources
sharp://infoβ GPU status, configurationsharp://helpβ Usage documentation
Multi-GPU Configuration
Select GPU via environment variable:
# Use GPU 0 (e.g., 4090)
CUDA_VISIBLE_DEVICES=0 uv run python app.py
# Use GPU 1 (e.g., 3090)
CUDA_VISIBLE_DEVICES=1 uv run python app.py
HuggingFace Spaces GPU
The app supports HuggingFace Spaces paid GPUs for faster inference or larger models. Configure via the Settings tab.
Available Hardware
| Hardware | VRAM | Price/hr | Best For |
|---|---|---|---|
| ZeroGPU (H200) | 70GB | Free (PRO) | Demos, dynamic allocation |
| T4 small | 16GB | $0.40 | Light workloads |
| T4 medium | 16GB | $0.60 | Standard workloads |
| L4x1 | 24GB | $0.80 | Standard inference |
| L4x4 | 96GB | $3.80 | Multi-GPU |
| L40Sx1 | 48GB | $1.80 | Large models |
| L40Sx4 | 192GB | $8.30 | Very large models |
| A10G small | 24GB | $1.00 | Balanced |
| A10G large | 24GB | $1.50 | More CPU/RAM |
| A100 large | 80GB | $2.50 | Maximum VRAM |
Deploying to Spaces
- Push to HuggingFace Space
- Set hardware in Space settings (or use
suggested_hardwarein README.md) - The app auto-detects Spaces environment via
SPACE_IDenv var
README.md Metadata for Spaces
---
title: SHARP - 3D Gaussian Scene Prediction
emoji: πͺ
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 6.2.0
python_version: 3.13.11
app_file: app.py
suggested_hardware: l4x1 # or zero-gpu, a100-large, etc.
startup_duration_timeout: 1h
preload_from_hub:
- apple/Sharp sharp_2572gikvuh.pt
---
Examples System
Place precompiled outputs in assets/examples/:
<name>.{jpg,png,webp}+<name>.mp4+<name>.ply- Or define
assets/examples/manifest.jsonwith{label, image, video, ply}entries
Multi-Image Stacking Roadmap
SHARP predicts 3D Gaussians from a single image. To "stack" multiple images into a unified scene:
Required Components
Pose Estimation (
multi_view.py)- Estimate relative camera poses between images
- Options: COLMAP, hloc, or PnP-based
- Transform each prediction to common world frame
Gaussian Merging (
gaussian_merge.py)- Concatenate Gaussian parameters (means, covariances, colors, opacities)
- Deduplicate overlapping regions via density-based filtering
- Optional: fine-tune merged scene with photometric loss
UI Changes
- Multi-upload widget
- Alignment preview/validation
- Progress indicator for multi-image processing
Data Structures
@dataclass
class AlignedGaussians:
gaussians: Gaussians3D
world_transform: torch.Tensor # 4x4 SE(3)
source_image: Path
def merge_gaussians(aligned: list[AlignedGaussians]) -> Gaussians3D:
# 1. Transform each Gaussian's means by world_transform
# 2. Concatenate all parameters
# 3. Density-based pruning in overlapping regions
...
Dependencies to Add
pycolmaporhlocfor pose estimationopen3dfor point cloud operations (optional)
Implementation Phases
Phase 1: Basic Multi-Image Pipeline
- Add
multi_view.pywithestimate_relative_pose(img1, img2)using feature matching - Add
gaussian_merge.pywith naive concatenation (no dedup) - UI: Multi-file upload in new "Stack" tab
- Export merged PLY
Phase 2: Pose Estimation Options
- Integrate COLMAP sparse reconstruction for >2 images
- Add hloc (Hierarchical Localization) as lightweight alternative
- Fallback: manual pose input for known camera rigs
Phase 3: Gaussian Deduplication
- Implement KD-tree based nearest-neighbor pruning
- Merge overlapping Gaussians by averaging parameters
- Add confidence weighting based on view angle
Phase 4: Refinement (Optional)
- Photometric loss optimization on merged scene
- Iterative alignment refinement
- Support for depth priors from stereo/MVS
API Design
# multi_view.py
def estimate_poses(
images: list[Path],
method: Literal["colmap", "hloc", "pnp"] = "hloc",
) -> list[np.ndarray]: # List of 4x4 world-to-camera transforms
...
# gaussian_merge.py
def merge_scenes(
predictions: list[PredictionOutputs],
poses: list[np.ndarray],
deduplicate: bool = True,
dedup_radius: float = 0.01, # meters
) -> Gaussians3D:
...
# app.py (Stack tab)
def run_stack(
images: list[str], # Gradio multi-file upload
pose_method: str,
deduplicate: bool,
) -> tuple[str | None, str | None, str]: # video, ply, status
...
MCP Extension
# mcp_server.py additions
@mcp.tool()
def sharp_stack(
image_paths: list[str],
pose_method: str = "hloc",
deduplicate: bool = True,
render_video: bool = True,
) -> dict:
"""Stack multiple images into unified 3D Gaussian scene."""
...
Technical Considerations
Coordinate Systems:
- SHARP outputs Gaussians in camera-centric coordinates
- Need to transform to world frame using estimated poses
- Convention: Y-up, -Z forward (OpenGL style)
Memory Management:
- Each SHARP prediction ~50-200MB GPU memory
- Batch processing with model unload between predictions
- Consider streaming merge for >10 images
Quality Metrics:
- Reprojection error for pose validation
- Gaussian density histogram for coverage analysis
- Visual comparison with ground truth (if available)