ml-sharp

Paused

App Files Files Community

ml-sharp / WARP.md

Robin L. M. Cheung, MBA

feat: Add local CUDA support, MCP server, Spaces GPU selection, and stacking roadmap

01504c4 13 days ago

preview code

raw

history blame contribute delete

10.8 kB

	# WARP.md

	This file provides guidance to WARP (warp.dev) when working with code in this repository.

	## Project Overview

	SHARP (Single-image 3D Gaussian scene prediction) Gradio demo. Wraps Apple's SHARP model to predict 3D Gaussian scenes from single images, export `.ply` files, and optionally render camera trajectory videos.

	Optimized for local CUDA (4090/3090/3070ti) or HuggingFace Spaces GPU. Includes MCP server for programmatic access.

	## Development Commands

	```bash
	# Install dependencies (uses uv package manager)
	uv sync

	# Run the Gradio app (port 49200 by default)
	uv run python app.py

	# Run MCP server (stdio transport)
	uv run python mcp_server.py

	# Lint with ruff
	uv run ruff check .
	uv run ruff format .
	```

	## Codebase Map

	```
	ml-sharp/
	├── app.py # Gradio UI (tabs: Run, Examples, About, Settings)
	│ ├── build_demo() # Main UI builder
	│ ├── run_sharp() # Inference entrypoint called by UI
	│ └── discover_examples() # Load precompiled examples
	├── model_utils.py # Core inference + rendering
	│ ├── ModelWrapper # Checkpoint loading, predictor caching
	│ │ ├── predict_to_ply() # Image → Gaussians → PLY
	│ │ └── render_video() # Gaussians → MP4 trajectory
	│ ├── PredictionOutputs # Dataclass for inference results
	│ ├── configure_gpu_mode() # Switch between local/Spaces GPU
	│ └── predict_and_maybe_render_gpu # Module-level entrypoint
	├── hardware_config.py # GPU hardware selection & persistence
	│ ├── HardwareConfig # Dataclass with mode, hardware, duration
	│ ├── get_hardware_choices() # Dropdown options
	│ └── SPACES_HARDWARE_SPECS # HF Spaces GPU specs & pricing
	├── mcp_server.py # MCP server for programmatic access
	│ ├── sharp_predict # Tool: image → PLY + video
	│ ├── list_outputs # Tool: list generated files
	│ └── sharp://info # Resource: GPU status, config
	├── assets/examples/ # Precompiled example outputs
	├── outputs/ # Runtime outputs (PLY, MP4)
	├── .hardware_config.json # Persisted hardware settings
	├── pyproject.toml # Dependencies (uv)
	└── WARP.md # This file
	```

	### Data Flow

	```
	Image → load_rgb() → predict_image() → Gaussians3D → save_ply() → PLY
	↓
	render_video() → MP4
	```

	## Architecture

	### Core Files

	- `app.py` — Gradio UI with tabs for Run/Examples/About/Settings. Handles example discovery from `assets/examples/` via manifest.json or filename conventions.
	- `model_utils.py` — SHARP model wrapper with checkpoint loading (HF Hub → CDN fallback), inference via `predict_to_ply()`, and CUDA video rendering via `render_video()`.
	- `hardware_config.py` — GPU hardware selection between local CUDA and HuggingFace Spaces. Persists to `.hardware_config.json`.
	- `mcp_server.py` — MCP server exposing `sharp_predict` tool and `sharp://info` resource.

	### Key Patterns

	Local CUDA mode: Model kept on GPU by default (`SHARP_KEEP_MODEL_ON_DEVICE=1`) for better performance on dedicated GPUs.

	Spaces GPU mode: Uses `@spaces.GPU` decorator for dynamic GPU allocation on HuggingFace Spaces. Configurable via Settings tab.

	Checkpoint resolution order:
	1. `SHARP_CHECKPOINT_PATH` env var
	2. HF Hub cache
	3. HF Hub download
	4. Upstream CDN via `torch.hub`

	Video rendering: Requires CUDA (gsplat). Falls back gracefully on CPU-only systems by returning `None` for video path.

	## Environment Variables

	\| Variable \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `SHARP_PORT` \| `49200` \| Gradio server port \|
	\| `SHARP_MCP_PORT` \| `49201` \| MCP server port \|
	\| `SHARP_CHECKPOINT_PATH` \| — \| Override local checkpoint path \|
	\| `SHARP_HF_REPO_ID` \| `apple/Sharp` \| HuggingFace repo \|
	\| `SHARP_HF_FILENAME` \| `sharp_2572gikvuh.pt` \| Checkpoint filename \|
	\| `SHARP_KEEP_MODEL_ON_DEVICE` \| `1` \| Keep model on GPU (set `0` to free VRAM) \|
	\| `CUDA_VISIBLE_DEVICES` \| — \| GPU selection (e.g., `0` or `0,1`) \|

	## Gradio API

	API is enabled by default. Access at `http://localhost:49200/?view=api`.

	### Endpoint: `/api/run_sharp`

	```python
	import requests

	response = requests.post(
	"http://localhost:49200/api/run_sharp",
	json={
	"data": [
	"/path/to/image.jpg", # image_path
	"rotate_forward", # trajectory_type
	0, # output_long_side (0 = match input)
	60, # num_frames
	30, # fps
	True, # render_video
	]
	}
	)
	result = response.json()["data"]
	video_path, ply_path, status = result
	```

	## MCP Server

	Run the MCP server for integration with AI agents:

	```bash
	uv run python mcp_server.py
	```

	### MCP Config (for clients like Warp)

	```json
	{
	"mcpServers": {
	"sharp": {
	"command": "uv",
	"args": ["run", "python", "mcp_server.py"],
	"cwd": "/home/robin/CascadeProjects/ml-sharp"
	}
	}
	}
	```

	### Tools

	- `sharp_predict(image_path, render_video=True, trajectory_type="rotate_forward", ...)` — Run inference
	- `list_outputs()` — List generated PLY/MP4 files

	### Resources

	- `sharp://info` — GPU status, configuration
	- `sharp://help` — Usage documentation

	## Multi-GPU Configuration

	Select GPU via environment variable:

	```bash
	# Use GPU 0 (e.g., 4090)
	CUDA_VISIBLE_DEVICES=0 uv run python app.py

	# Use GPU 1 (e.g., 3090)
	CUDA_VISIBLE_DEVICES=1 uv run python app.py
	```

	## HuggingFace Spaces GPU

	The app supports HuggingFace Spaces paid GPUs for faster inference or larger models. Configure via the Settings tab.

	### Available Hardware

	\| Hardware \| VRAM \| Price/hr \| Best For \|
	\|----------\|------\|----------\|----------\|
	\| ZeroGPU (H200) \| 70GB \| Free (PRO) \| Demos, dynamic allocation \|
	\| T4 small \| 16GB \| $0.40 \| Light workloads \|
	\| T4 medium \| 16GB \| $0.60 \| Standard workloads \|
	\| L4x1 \| 24GB \| $0.80 \| Standard inference \|
	\| L4x4 \| 96GB \| $3.80 \| Multi-GPU \|
	\| L40Sx1 \| 48GB \| $1.80 \| Large models \|
	\| L40Sx4 \| 192GB \| $8.30 \| Very large models \|
	\| A10G small \| 24GB \| $1.00 \| Balanced \|
	\| A10G large \| 24GB \| $1.50 \| More CPU/RAM \|
	\| A100 large \| 80GB \| $2.50 \| Maximum VRAM \|

	### Deploying to Spaces

	1. Push to HuggingFace Space
	2. Set hardware in Space settings (or use `suggested_hardware` in README.md)
	3. The app auto-detects Spaces environment via `SPACE_ID` env var

	### README.md Metadata for Spaces

	```yaml
	---
	title: SHARP - 3D Gaussian Scene Prediction
	emoji: 🔪
	colorFrom: purple
	colorTo: indigo
	sdk: gradio
	sdk_version: 6.2.0
	python_version: 3.13.11
	app_file: app.py
	suggested_hardware: l4x1 # or zero-gpu, a100-large, etc.
	startup_duration_timeout: 1h
	preload_from_hub:
	- apple/Sharp sharp_2572gikvuh.pt
	---
	```

	## Examples System

	Place precompiled outputs in `assets/examples/`:
	- `<name>.{jpg,png,webp}` + `<name>.mp4` + `<name>.ply`
	- Or define `assets/examples/manifest.json` with `{label, image, video, ply}` entries

	## Multi-Image Stacking Roadmap

	SHARP predicts 3D Gaussians from a single image. To "stack" multiple images into a unified scene:

	### Required Components

	1. Pose Estimation (`multi_view.py`)
	- Estimate relative camera poses between images
	- Options: COLMAP, hloc, or PnP-based
	- Transform each prediction to common world frame

	2. Gaussian Merging (`gaussian_merge.py`)
	- Concatenate Gaussian parameters (means, covariances, colors, opacities)
	- Deduplicate overlapping regions via density-based filtering
	- Optional: fine-tune merged scene with photometric loss

	3. UI Changes
	- Multi-upload widget
	- Alignment preview/validation
	- Progress indicator for multi-image processing

	### Data Structures

	```python
	@dataclass
	class AlignedGaussians:
	gaussians: Gaussians3D
	world_transform: torch.Tensor # 4x4 SE(3)
	source_image: Path

	def merge_gaussians(aligned: list[AlignedGaussians]) -> Gaussians3D:
	# 1. Transform each Gaussian's means by world_transform
	# 2. Concatenate all parameters
	# 3. Density-based pruning in overlapping regions
	...
	```

	### Dependencies to Add

	- `pycolmap` or `hloc` for pose estimation
	- `open3d` for point cloud operations (optional)

	### Implementation Phases

	#### Phase 1: Basic Multi-Image Pipeline
	- [ ] Add `multi_view.py` with `estimate_relative_pose(img1, img2)` using feature matching
	- [ ] Add `gaussian_merge.py` with naive concatenation (no dedup)
	- [ ] UI: Multi-file upload in new "Stack" tab
	- [ ] Export merged PLY

	#### Phase 2: Pose Estimation Options
	- [ ] Integrate COLMAP sparse reconstruction for >2 images
	- [ ] Add hloc (Hierarchical Localization) as lightweight alternative
	- [ ] Fallback: manual pose input for known camera rigs

	#### Phase 3: Gaussian Deduplication
	- [ ] Implement KD-tree based nearest-neighbor pruning
	- [ ] Merge overlapping Gaussians by averaging parameters
	- [ ] Add confidence weighting based on view angle

	#### Phase 4: Refinement (Optional)
	- [ ] Photometric loss optimization on merged scene
	- [ ] Iterative alignment refinement
	- [ ] Support for depth priors from stereo/MVS

	### API Design

	```python
	# multi_view.py
	def estimate_poses(
	images: list[Path],
	method: Literal["colmap", "hloc", "pnp"] = "hloc",
	) -> list[np.ndarray]: # List of 4x4 world-to-camera transforms
	...

	# gaussian_merge.py
	def merge_scenes(
	predictions: list[PredictionOutputs],
	poses: list[np.ndarray],
	deduplicate: bool = True,
	dedup_radius: float = 0.01, # meters
	) -> Gaussians3D:
	...

	# app.py (Stack tab)
	def run_stack(
	images: list[str], # Gradio multi-file upload
	pose_method: str,
	deduplicate: bool,
	) -> tuple[str \| None, str \| None, str]: # video, ply, status
	...
	```

	### MCP Extension

	```python
	# mcp_server.py additions
	@mcp.tool()
	def sharp_stack(
	image_paths: list[str],
	pose_method: str = "hloc",
	deduplicate: bool = True,
	render_video: bool = True,
	) -> dict:
	"""Stack multiple images into unified 3D Gaussian scene."""
	...
	```

	### Technical Considerations

	Coordinate Systems:
	- SHARP outputs Gaussians in camera-centric coordinates
	- Need to transform to world frame using estimated poses
	- Convention: Y-up, -Z forward (OpenGL style)

	Memory Management:
	- Each SHARP prediction ~50-200MB GPU memory
	- Batch processing with model unload between predictions
	- Consider streaming merge for >10 images

	Quality Metrics:
	- Reprojection error for pose validation
	- Gaussian density histogram for coverage analysis
	- Visual comparison with ground truth (if available)