File size: 10,794 Bytes
01504c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
# WARP.md

This file provides guidance to WARP (warp.dev) when working with code in this repository.

## Project Overview

SHARP (Single-image 3D Gaussian scene prediction) Gradio demo. Wraps Apple's SHARP model to predict 3D Gaussian scenes from single images, export `.ply` files, and optionally render camera trajectory videos.

Optimized for local CUDA (4090/3090/3070ti) or HuggingFace Spaces GPU. Includes MCP server for programmatic access.

## Development Commands

```bash
# Install dependencies (uses uv package manager)
uv sync

# Run the Gradio app (port 49200 by default)
uv run python app.py

# Run MCP server (stdio transport)
uv run python mcp_server.py

# Lint with ruff
uv run ruff check .
uv run ruff format .
```

## Codebase Map

```
ml-sharp/
β”œβ”€β”€ app.py              # Gradio UI (tabs: Run, Examples, About, Settings)
β”‚   β”œβ”€β”€ build_demo()    # Main UI builder
β”‚   β”œβ”€β”€ run_sharp()     # Inference entrypoint called by UI
β”‚   └── discover_examples()  # Load precompiled examples
β”œβ”€β”€ model_utils.py      # Core inference + rendering
β”‚   β”œβ”€β”€ ModelWrapper    # Checkpoint loading, predictor caching
β”‚   β”‚   β”œβ”€β”€ predict_to_ply()   # Image β†’ Gaussians β†’ PLY
β”‚   β”‚   └── render_video()     # Gaussians β†’ MP4 trajectory
β”‚   β”œβ”€β”€ PredictionOutputs      # Dataclass for inference results
β”‚   β”œβ”€β”€ configure_gpu_mode()   # Switch between local/Spaces GPU
β”‚   └── predict_and_maybe_render_gpu  # Module-level entrypoint
β”œβ”€β”€ hardware_config.py  # GPU hardware selection & persistence
β”‚   β”œβ”€β”€ HardwareConfig  # Dataclass with mode, hardware, duration
β”‚   β”œβ”€β”€ get_hardware_choices()  # Dropdown options
β”‚   └── SPACES_HARDWARE_SPECS   # HF Spaces GPU specs & pricing
β”œβ”€β”€ mcp_server.py       # MCP server for programmatic access
β”‚   β”œβ”€β”€ sharp_predict   # Tool: image β†’ PLY + video
β”‚   β”œβ”€β”€ list_outputs    # Tool: list generated files
β”‚   └── sharp://info    # Resource: GPU status, config
β”œβ”€β”€ assets/examples/    # Precompiled example outputs
β”œβ”€β”€ outputs/            # Runtime outputs (PLY, MP4)
β”œβ”€β”€ .hardware_config.json  # Persisted hardware settings
β”œβ”€β”€ pyproject.toml      # Dependencies (uv)
└── WARP.md             # This file
```

### Data Flow

```
Image β†’ load_rgb() β†’ predict_image() β†’ Gaussians3D β†’ save_ply() β†’ PLY
                                              ↓
                                      render_video() β†’ MP4
```

## Architecture

### Core Files

- `app.py` β€” Gradio UI with tabs for Run/Examples/About/Settings. Handles example discovery from `assets/examples/` via manifest.json or filename conventions.
- `model_utils.py` β€” SHARP model wrapper with checkpoint loading (HF Hub β†’ CDN fallback), inference via `predict_to_ply()`, and CUDA video rendering via `render_video()`.
- `hardware_config.py` β€” GPU hardware selection between local CUDA and HuggingFace Spaces. Persists to `.hardware_config.json`.
- `mcp_server.py` β€” MCP server exposing `sharp_predict` tool and `sharp://info` resource.

### Key Patterns

**Local CUDA mode**: Model kept on GPU by default (`SHARP_KEEP_MODEL_ON_DEVICE=1`) for better performance on dedicated GPUs.

**Spaces GPU mode**: Uses `@spaces.GPU` decorator for dynamic GPU allocation on HuggingFace Spaces. Configurable via Settings tab.

**Checkpoint resolution order**: 
1. `SHARP_CHECKPOINT_PATH` env var
2. HF Hub cache
3. HF Hub download
4. Upstream CDN via `torch.hub`

**Video rendering**: Requires CUDA (gsplat). Falls back gracefully on CPU-only systems by returning `None` for video path.

## Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `SHARP_PORT` | `49200` | Gradio server port |
| `SHARP_MCP_PORT` | `49201` | MCP server port |
| `SHARP_CHECKPOINT_PATH` | β€” | Override local checkpoint path |
| `SHARP_HF_REPO_ID` | `apple/Sharp` | HuggingFace repo |
| `SHARP_HF_FILENAME` | `sharp_2572gikvuh.pt` | Checkpoint filename |
| `SHARP_KEEP_MODEL_ON_DEVICE` | `1` | Keep model on GPU (set `0` to free VRAM) |
| `CUDA_VISIBLE_DEVICES` | β€” | GPU selection (e.g., `0` or `0,1`) |

## Gradio API

API is enabled by default. Access at `http://localhost:49200/?view=api`.

### Endpoint: `/api/run_sharp`

```python
import requests

response = requests.post(
    "http://localhost:49200/api/run_sharp",
    json={
        "data": [
            "/path/to/image.jpg",  # image_path
            "rotate_forward",       # trajectory_type
            0,                       # output_long_side (0 = match input)
            60,                      # num_frames
            30,                      # fps
            True,                    # render_video
        ]
    }
)
result = response.json()["data"]
video_path, ply_path, status = result
```

## MCP Server

Run the MCP server for integration with AI agents:

```bash
uv run python mcp_server.py
```

### MCP Config (for clients like Warp)

```json
{
  "mcpServers": {
    "sharp": {
      "command": "uv",
      "args": ["run", "python", "mcp_server.py"],
      "cwd": "/home/robin/CascadeProjects/ml-sharp"
    }
  }
}
```

### Tools

- `sharp_predict(image_path, render_video=True, trajectory_type="rotate_forward", ...)` β€” Run inference
- `list_outputs()` β€” List generated PLY/MP4 files

### Resources

- `sharp://info` β€” GPU status, configuration
- `sharp://help` β€” Usage documentation

## Multi-GPU Configuration

Select GPU via environment variable:

```bash
# Use GPU 0 (e.g., 4090)
CUDA_VISIBLE_DEVICES=0 uv run python app.py

# Use GPU 1 (e.g., 3090)
CUDA_VISIBLE_DEVICES=1 uv run python app.py
```

## HuggingFace Spaces GPU

The app supports HuggingFace Spaces paid GPUs for faster inference or larger models. Configure via the **Settings** tab.

### Available Hardware

| Hardware | VRAM | Price/hr | Best For |
|----------|------|----------|----------|
| ZeroGPU (H200) | 70GB | Free (PRO) | Demos, dynamic allocation |
| T4 small | 16GB | $0.40 | Light workloads |
| T4 medium | 16GB | $0.60 | Standard workloads |
| L4x1 | 24GB | $0.80 | Standard inference |
| L4x4 | 96GB | $3.80 | Multi-GPU |
| L40Sx1 | 48GB | $1.80 | Large models |
| L40Sx4 | 192GB | $8.30 | Very large models |
| A10G small | 24GB | $1.00 | Balanced |
| A10G large | 24GB | $1.50 | More CPU/RAM |
| A100 large | 80GB | $2.50 | Maximum VRAM |

### Deploying to Spaces

1. Push to HuggingFace Space
2. Set hardware in Space settings (or use `suggested_hardware` in README.md)
3. The app auto-detects Spaces environment via `SPACE_ID` env var

### README.md Metadata for Spaces

```yaml
---
title: SHARP - 3D Gaussian Scene Prediction
emoji: πŸ”ͺ
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 6.2.0
python_version: 3.13.11
app_file: app.py
suggested_hardware: l4x1  # or zero-gpu, a100-large, etc.
startup_duration_timeout: 1h
preload_from_hub:
- apple/Sharp sharp_2572gikvuh.pt
---
```

## Examples System

Place precompiled outputs in `assets/examples/`:
- `<name>.{jpg,png,webp}` + `<name>.mp4` + `<name>.ply`
- Or define `assets/examples/manifest.json` with `{label, image, video, ply}` entries

## Multi-Image Stacking Roadmap

SHARP predicts 3D Gaussians from a single image. To "stack" multiple images into a unified scene:

### Required Components

1. **Pose Estimation** (`multi_view.py`)
   - Estimate relative camera poses between images
   - Options: COLMAP, hloc, or PnP-based
   - Transform each prediction to common world frame

2. **Gaussian Merging** (`gaussian_merge.py`)
   - Concatenate Gaussian parameters (means, covariances, colors, opacities)
   - Deduplicate overlapping regions via density-based filtering
   - Optional: fine-tune merged scene with photometric loss

3. **UI Changes**
   - Multi-upload widget
   - Alignment preview/validation
   - Progress indicator for multi-image processing

### Data Structures

```python
@dataclass
class AlignedGaussians:
    gaussians: Gaussians3D
    world_transform: torch.Tensor  # 4x4 SE(3)
    source_image: Path

def merge_gaussians(aligned: list[AlignedGaussians]) -> Gaussians3D:
    # 1. Transform each Gaussian's means by world_transform
    # 2. Concatenate all parameters
    # 3. Density-based pruning in overlapping regions
    ...
```

### Dependencies to Add

- `pycolmap` or `hloc` for pose estimation
- `open3d` for point cloud operations (optional)

### Implementation Phases

#### Phase 1: Basic Multi-Image Pipeline
- [ ] Add `multi_view.py` with `estimate_relative_pose(img1, img2)` using feature matching
- [ ] Add `gaussian_merge.py` with naive concatenation (no dedup)
- [ ] UI: Multi-file upload in new "Stack" tab
- [ ] Export merged PLY

#### Phase 2: Pose Estimation Options
- [ ] Integrate COLMAP sparse reconstruction for >2 images
- [ ] Add hloc (Hierarchical Localization) as lightweight alternative
- [ ] Fallback: manual pose input for known camera rigs

#### Phase 3: Gaussian Deduplication
- [ ] Implement KD-tree based nearest-neighbor pruning
- [ ] Merge overlapping Gaussians by averaging parameters
- [ ] Add confidence weighting based on view angle

#### Phase 4: Refinement (Optional)
- [ ] Photometric loss optimization on merged scene
- [ ] Iterative alignment refinement
- [ ] Support for depth priors from stereo/MVS

### API Design

```python
# multi_view.py
def estimate_poses(
    images: list[Path],
    method: Literal["colmap", "hloc", "pnp"] = "hloc",
) -> list[np.ndarray]:  # List of 4x4 world-to-camera transforms
    ...

# gaussian_merge.py
def merge_scenes(
    predictions: list[PredictionOutputs],
    poses: list[np.ndarray],
    deduplicate: bool = True,
    dedup_radius: float = 0.01,  # meters
) -> Gaussians3D:
    ...

# app.py (Stack tab)
def run_stack(
    images: list[str],  # Gradio multi-file upload
    pose_method: str,
    deduplicate: bool,
) -> tuple[str | None, str | None, str]:  # video, ply, status
    ...
```

### MCP Extension

```python
# mcp_server.py additions
@mcp.tool()
def sharp_stack(
    image_paths: list[str],
    pose_method: str = "hloc",
    deduplicate: bool = True,
    render_video: bool = True,
) -> dict:
    """Stack multiple images into unified 3D Gaussian scene."""
    ...
```

### Technical Considerations

**Coordinate Systems**:
- SHARP outputs Gaussians in camera-centric coordinates
- Need to transform to world frame using estimated poses
- Convention: Y-up, -Z forward (OpenGL style)

**Memory Management**:
- Each SHARP prediction ~50-200MB GPU memory
- Batch processing with model unload between predictions
- Consider streaming merge for >10 images

**Quality Metrics**:
- Reprojection error for pose validation
- Gaussian density histogram for coverage analysis
- Visual comparison with ground truth (if available)