Spaces:
Paused
Paused
Robin L. M. Cheung, MBA
feat: Add local CUDA support, MCP server, Spaces GPU selection, and stacking roadmap
01504c4
| # WARP.md | |
| This file provides guidance to WARP (warp.dev) when working with code in this repository. | |
| ## Project Overview | |
| SHARP (Single-image 3D Gaussian scene prediction) Gradio demo. Wraps Apple's SHARP model to predict 3D Gaussian scenes from single images, export `.ply` files, and optionally render camera trajectory videos. | |
| Optimized for local CUDA (4090/3090/3070ti) or HuggingFace Spaces GPU. Includes MCP server for programmatic access. | |
| ## Development Commands | |
| ```bash | |
| # Install dependencies (uses uv package manager) | |
| uv sync | |
| # Run the Gradio app (port 49200 by default) | |
| uv run python app.py | |
| # Run MCP server (stdio transport) | |
| uv run python mcp_server.py | |
| # Lint with ruff | |
| uv run ruff check . | |
| uv run ruff format . | |
| ``` | |
| ## Codebase Map | |
| ``` | |
| ml-sharp/ | |
| βββ app.py # Gradio UI (tabs: Run, Examples, About, Settings) | |
| β βββ build_demo() # Main UI builder | |
| β βββ run_sharp() # Inference entrypoint called by UI | |
| β βββ discover_examples() # Load precompiled examples | |
| βββ model_utils.py # Core inference + rendering | |
| β βββ ModelWrapper # Checkpoint loading, predictor caching | |
| β β βββ predict_to_ply() # Image β Gaussians β PLY | |
| β β βββ render_video() # Gaussians β MP4 trajectory | |
| β βββ PredictionOutputs # Dataclass for inference results | |
| β βββ configure_gpu_mode() # Switch between local/Spaces GPU | |
| β βββ predict_and_maybe_render_gpu # Module-level entrypoint | |
| βββ hardware_config.py # GPU hardware selection & persistence | |
| β βββ HardwareConfig # Dataclass with mode, hardware, duration | |
| β βββ get_hardware_choices() # Dropdown options | |
| β βββ SPACES_HARDWARE_SPECS # HF Spaces GPU specs & pricing | |
| βββ mcp_server.py # MCP server for programmatic access | |
| β βββ sharp_predict # Tool: image β PLY + video | |
| β βββ list_outputs # Tool: list generated files | |
| β βββ sharp://info # Resource: GPU status, config | |
| βββ assets/examples/ # Precompiled example outputs | |
| βββ outputs/ # Runtime outputs (PLY, MP4) | |
| βββ .hardware_config.json # Persisted hardware settings | |
| βββ pyproject.toml # Dependencies (uv) | |
| βββ WARP.md # This file | |
| ``` | |
| ### Data Flow | |
| ``` | |
| Image β load_rgb() β predict_image() β Gaussians3D β save_ply() β PLY | |
| β | |
| render_video() β MP4 | |
| ``` | |
| ## Architecture | |
| ### Core Files | |
| - `app.py` β Gradio UI with tabs for Run/Examples/About/Settings. Handles example discovery from `assets/examples/` via manifest.json or filename conventions. | |
| - `model_utils.py` β SHARP model wrapper with checkpoint loading (HF Hub β CDN fallback), inference via `predict_to_ply()`, and CUDA video rendering via `render_video()`. | |
| - `hardware_config.py` β GPU hardware selection between local CUDA and HuggingFace Spaces. Persists to `.hardware_config.json`. | |
| - `mcp_server.py` β MCP server exposing `sharp_predict` tool and `sharp://info` resource. | |
| ### Key Patterns | |
| **Local CUDA mode**: Model kept on GPU by default (`SHARP_KEEP_MODEL_ON_DEVICE=1`) for better performance on dedicated GPUs. | |
| **Spaces GPU mode**: Uses `@spaces.GPU` decorator for dynamic GPU allocation on HuggingFace Spaces. Configurable via Settings tab. | |
| **Checkpoint resolution order**: | |
| 1. `SHARP_CHECKPOINT_PATH` env var | |
| 2. HF Hub cache | |
| 3. HF Hub download | |
| 4. Upstream CDN via `torch.hub` | |
| **Video rendering**: Requires CUDA (gsplat). Falls back gracefully on CPU-only systems by returning `None` for video path. | |
| ## Environment Variables | |
| | Variable | Default | Description | | |
| |----------|---------|-------------| | |
| | `SHARP_PORT` | `49200` | Gradio server port | | |
| | `SHARP_MCP_PORT` | `49201` | MCP server port | | |
| | `SHARP_CHECKPOINT_PATH` | β | Override local checkpoint path | | |
| | `SHARP_HF_REPO_ID` | `apple/Sharp` | HuggingFace repo | | |
| | `SHARP_HF_FILENAME` | `sharp_2572gikvuh.pt` | Checkpoint filename | | |
| | `SHARP_KEEP_MODEL_ON_DEVICE` | `1` | Keep model on GPU (set `0` to free VRAM) | | |
| | `CUDA_VISIBLE_DEVICES` | β | GPU selection (e.g., `0` or `0,1`) | | |
| ## Gradio API | |
| API is enabled by default. Access at `http://localhost:49200/?view=api`. | |
| ### Endpoint: `/api/run_sharp` | |
| ```python | |
| import requests | |
| response = requests.post( | |
| "http://localhost:49200/api/run_sharp", | |
| json={ | |
| "data": [ | |
| "/path/to/image.jpg", # image_path | |
| "rotate_forward", # trajectory_type | |
| 0, # output_long_side (0 = match input) | |
| 60, # num_frames | |
| 30, # fps | |
| True, # render_video | |
| ] | |
| } | |
| ) | |
| result = response.json()["data"] | |
| video_path, ply_path, status = result | |
| ``` | |
| ## MCP Server | |
| Run the MCP server for integration with AI agents: | |
| ```bash | |
| uv run python mcp_server.py | |
| ``` | |
| ### MCP Config (for clients like Warp) | |
| ```json | |
| { | |
| "mcpServers": { | |
| "sharp": { | |
| "command": "uv", | |
| "args": ["run", "python", "mcp_server.py"], | |
| "cwd": "/home/robin/CascadeProjects/ml-sharp" | |
| } | |
| } | |
| } | |
| ``` | |
| ### Tools | |
| - `sharp_predict(image_path, render_video=True, trajectory_type="rotate_forward", ...)` β Run inference | |
| - `list_outputs()` β List generated PLY/MP4 files | |
| ### Resources | |
| - `sharp://info` β GPU status, configuration | |
| - `sharp://help` β Usage documentation | |
| ## Multi-GPU Configuration | |
| Select GPU via environment variable: | |
| ```bash | |
| # Use GPU 0 (e.g., 4090) | |
| CUDA_VISIBLE_DEVICES=0 uv run python app.py | |
| # Use GPU 1 (e.g., 3090) | |
| CUDA_VISIBLE_DEVICES=1 uv run python app.py | |
| ``` | |
| ## HuggingFace Spaces GPU | |
| The app supports HuggingFace Spaces paid GPUs for faster inference or larger models. Configure via the **Settings** tab. | |
| ### Available Hardware | |
| | Hardware | VRAM | Price/hr | Best For | | |
| |----------|------|----------|----------| | |
| | ZeroGPU (H200) | 70GB | Free (PRO) | Demos, dynamic allocation | | |
| | T4 small | 16GB | $0.40 | Light workloads | | |
| | T4 medium | 16GB | $0.60 | Standard workloads | | |
| | L4x1 | 24GB | $0.80 | Standard inference | | |
| | L4x4 | 96GB | $3.80 | Multi-GPU | | |
| | L40Sx1 | 48GB | $1.80 | Large models | | |
| | L40Sx4 | 192GB | $8.30 | Very large models | | |
| | A10G small | 24GB | $1.00 | Balanced | | |
| | A10G large | 24GB | $1.50 | More CPU/RAM | | |
| | A100 large | 80GB | $2.50 | Maximum VRAM | | |
| ### Deploying to Spaces | |
| 1. Push to HuggingFace Space | |
| 2. Set hardware in Space settings (or use `suggested_hardware` in README.md) | |
| 3. The app auto-detects Spaces environment via `SPACE_ID` env var | |
| ### README.md Metadata for Spaces | |
| ```yaml | |
| --- | |
| title: SHARP - 3D Gaussian Scene Prediction | |
| emoji: πͺ | |
| colorFrom: purple | |
| colorTo: indigo | |
| sdk: gradio | |
| sdk_version: 6.2.0 | |
| python_version: 3.13.11 | |
| app_file: app.py | |
| suggested_hardware: l4x1 # or zero-gpu, a100-large, etc. | |
| startup_duration_timeout: 1h | |
| preload_from_hub: | |
| - apple/Sharp sharp_2572gikvuh.pt | |
| --- | |
| ``` | |
| ## Examples System | |
| Place precompiled outputs in `assets/examples/`: | |
| - `<name>.{jpg,png,webp}` + `<name>.mp4` + `<name>.ply` | |
| - Or define `assets/examples/manifest.json` with `{label, image, video, ply}` entries | |
| ## Multi-Image Stacking Roadmap | |
| SHARP predicts 3D Gaussians from a single image. To "stack" multiple images into a unified scene: | |
| ### Required Components | |
| 1. **Pose Estimation** (`multi_view.py`) | |
| - Estimate relative camera poses between images | |
| - Options: COLMAP, hloc, or PnP-based | |
| - Transform each prediction to common world frame | |
| 2. **Gaussian Merging** (`gaussian_merge.py`) | |
| - Concatenate Gaussian parameters (means, covariances, colors, opacities) | |
| - Deduplicate overlapping regions via density-based filtering | |
| - Optional: fine-tune merged scene with photometric loss | |
| 3. **UI Changes** | |
| - Multi-upload widget | |
| - Alignment preview/validation | |
| - Progress indicator for multi-image processing | |
| ### Data Structures | |
| ```python | |
| @dataclass | |
| class AlignedGaussians: | |
| gaussians: Gaussians3D | |
| world_transform: torch.Tensor # 4x4 SE(3) | |
| source_image: Path | |
| def merge_gaussians(aligned: list[AlignedGaussians]) -> Gaussians3D: | |
| # 1. Transform each Gaussian's means by world_transform | |
| # 2. Concatenate all parameters | |
| # 3. Density-based pruning in overlapping regions | |
| ... | |
| ``` | |
| ### Dependencies to Add | |
| - `pycolmap` or `hloc` for pose estimation | |
| - `open3d` for point cloud operations (optional) | |
| ### Implementation Phases | |
| #### Phase 1: Basic Multi-Image Pipeline | |
| - [ ] Add `multi_view.py` with `estimate_relative_pose(img1, img2)` using feature matching | |
| - [ ] Add `gaussian_merge.py` with naive concatenation (no dedup) | |
| - [ ] UI: Multi-file upload in new "Stack" tab | |
| - [ ] Export merged PLY | |
| #### Phase 2: Pose Estimation Options | |
| - [ ] Integrate COLMAP sparse reconstruction for >2 images | |
| - [ ] Add hloc (Hierarchical Localization) as lightweight alternative | |
| - [ ] Fallback: manual pose input for known camera rigs | |
| #### Phase 3: Gaussian Deduplication | |
| - [ ] Implement KD-tree based nearest-neighbor pruning | |
| - [ ] Merge overlapping Gaussians by averaging parameters | |
| - [ ] Add confidence weighting based on view angle | |
| #### Phase 4: Refinement (Optional) | |
| - [ ] Photometric loss optimization on merged scene | |
| - [ ] Iterative alignment refinement | |
| - [ ] Support for depth priors from stereo/MVS | |
| ### API Design | |
| ```python | |
| # multi_view.py | |
| def estimate_poses( | |
| images: list[Path], | |
| method: Literal["colmap", "hloc", "pnp"] = "hloc", | |
| ) -> list[np.ndarray]: # List of 4x4 world-to-camera transforms | |
| ... | |
| # gaussian_merge.py | |
| def merge_scenes( | |
| predictions: list[PredictionOutputs], | |
| poses: list[np.ndarray], | |
| deduplicate: bool = True, | |
| dedup_radius: float = 0.01, # meters | |
| ) -> Gaussians3D: | |
| ... | |
| # app.py (Stack tab) | |
| def run_stack( | |
| images: list[str], # Gradio multi-file upload | |
| pose_method: str, | |
| deduplicate: bool, | |
| ) -> tuple[str | None, str | None, str]: # video, ply, status | |
| ... | |
| ``` | |
| ### MCP Extension | |
| ```python | |
| # mcp_server.py additions | |
| @mcp.tool() | |
| def sharp_stack( | |
| image_paths: list[str], | |
| pose_method: str = "hloc", | |
| deduplicate: bool = True, | |
| render_video: bool = True, | |
| ) -> dict: | |
| """Stack multiple images into unified 3D Gaussian scene.""" | |
| ... | |
| ``` | |
| ### Technical Considerations | |
| **Coordinate Systems**: | |
| - SHARP outputs Gaussians in camera-centric coordinates | |
| - Need to transform to world frame using estimated poses | |
| - Convention: Y-up, -Z forward (OpenGL style) | |
| **Memory Management**: | |
| - Each SHARP prediction ~50-200MB GPU memory | |
| - Batch processing with model unload between predictions | |
| - Consider streaming merge for >10 images | |
| **Quality Metrics**: | |
| - Reprojection error for pose validation | |
| - Gaussian density histogram for coverage analysis | |
| - Visual comparison with ground truth (if available) | |