Add comprehensive model card for InfCam
Browse filesThis PR adds a comprehensive model card for the InfCam model.
It includes:
- **Metadata**: `pipeline_tag: image-to-video`. The `library_name` and `license` tags are omitted due to a lack of explicit evidence for their values in the provided context, adhering to strict disclaimers.
- **Key Links**: Direct links to the paper ([https://huggingface.co/papers/2512.17040](https://huggingface.co/papers/2512.17040)), project page ([https://emjay73.github.io/InfCam/](https://emjay73.github.io/InfCam/)), and GitHub repository ([https://github.com/emjay73/InfCam](https://github.com/emjay73/InfCam)).
- **Introduction**: A brief overview of the model from the paper abstract and the "TL;DR" from the GitHub README.
- **Teaser Video**: Link to the teaser video as provided in the GitHub README.
- **Sample Usage**: A detailed "Inference" section, including environment setup, checkpoint downloads, and specific `bash` and `python` inference commands, all directly copied from the GitHub README. This ensures accuracy and adherence to the "do not make up code" disclaimer. Literal newline characters in code snippets are preserved.
- **Additional Info**: Includes the camera trajectory table, acknowledgements, and BibTeX citation from the original GitHub repository.
These additions enhance discoverability on the Hugging Face Hub and provide users with comprehensive information and a clear guide to using the model.
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: image-to-video
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
# InfCam: Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation
|
| 6 |
+
|
| 7 |
+
This repository contains the InfCam model presented in the paper [Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation](https://huggingface.co/papers/2512.17040).
|
| 8 |
+
|
| 9 |
+
Project Page: [https://emjay73.github.io/InfCam/](https://emjay73.github.io/InfCam/)
|
| 10 |
+
Code: [https://github.com/emjay73/InfCam](https://github.com/emjay73/InfCam)
|
| 11 |
+
|
| 12 |
+
<div align="center">
|
| 13 |
+
[Min-Jung Kim<sup>*</sup>](https://emjay73.github.io/), [Jeongho Kim<sup>*</sup>](https://scholar.google.co.kr/citations?user=4SCCBFwAAAAJ&hl=ko), [Hoiyeong Jin<sup></sup>](https://scholar.google.co.kr/citations?hl=ko&user=Jp-zhtUAAAAJ), [Junha Hyung<sup></sup>](https://junhahyung.github.io/), [Jaegul Choo<sup></sup>](https://sites.google.com/site/jaegulchoo)
|
| 14 |
+
<br>
|
| 15 |
+
*Equal Contribution
|
| 16 |
+
<p align="center">
|
| 17 |
+
<img src="https://huggingface.co/emjay73/InfCam/resolve/main/assets/GSAI_preview_image.png" width="20%" alt="GSAI Preview">
|
| 18 |
+
</p>
|
| 19 |
+
</div>
|
| 20 |
+
|
| 21 |
+
## Teaser Video
|
| 22 |
+
https://github.com/user-attachments/assets/1c52baf4-b5ff-417e-a6c6-c8570e667bd8
|
| 23 |
+
|
| 24 |
+
## Introduction
|
| 25 |
+
|
| 26 |
+
InfCam is a depth-free, camera-controlled video-to-video generation framework with high pose fidelity. It aims to provide creators with cinematic camera control capabilities in post-production.
|
| 27 |
+
**TL;DR:** Given a video and a target camera trajectory, InfCam generates a video that faithfully follows the specified camera path without depth prior.
|
| 28 |
+
|
| 29 |
+
## 🔥 Updates
|
| 30 |
+
- [ ] Release training code
|
| 31 |
+
- [x] Release inference code and model weights (2025.12.19)
|
| 32 |
+
|
| 33 |
+
## ⚙️ Code
|
| 34 |
+
### Inference
|
| 35 |
+
Step 1: Set up the environment
|
| 36 |
+
|
| 37 |
+
```
|
| 38 |
+
conda create -n infcam python=3.12
|
| 39 |
+
conda activate infcam
|
| 40 |
+
|
| 41 |
+
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu121
|
| 42 |
+
pip install cupy-cuda12x
|
| 43 |
+
pip install transformers==4.46.2
|
| 44 |
+
pip install sentencepiece
|
| 45 |
+
pip install controlnet-aux==0.0.7
|
| 46 |
+
pip install imageio
|
| 47 |
+
pip install imageio[ffmpeg]
|
| 48 |
+
pip install safetensors
|
| 49 |
+
pip install einops
|
| 50 |
+
pip install protobuf
|
| 51 |
+
pip install modelscope
|
| 52 |
+
pip install ftfy
|
| 53 |
+
pip install lpips
|
| 54 |
+
pip install lightning
|
| 55 |
+
pip install pandas
|
| 56 |
+
pip install matplotlib
|
| 57 |
+
pip install wandb
|
| 58 |
+
pip install ffmpeg-python
|
| 59 |
+
pip install numpy
|
| 60 |
+
pip install opencv-python
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
Step 2: Download the pretrained checkpoints
|
| 64 |
+
1. Download the pre-trained Wan2.1 model
|
| 65 |
+
|
| 66 |
+
```shell
|
| 67 |
+
python download_wan2.1.py
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
2. Download the pre-trained UniDepth model
|
| 71 |
+
|
| 72 |
+
Download the pre-trained weights from [huggingface](https://huggingface.co/lpiccinelli/unidepth-v2-vitl14) and place it in ```models/unidepth-v2-vitl14```.
|
| 73 |
+
```shell
|
| 74 |
+
cd models
|
| 75 |
+
git clone https://huggingface.co/lpiccinelli/unidepth-v2-vitl14
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
3. Download the pre-trained InfCam checkpoint
|
| 79 |
+
|
| 80 |
+
Download the pre-trained InfCam weights from [huggingface](https://huggingface.co/emjay73/InfCam/tree/main) and place it in ```models/InfCam```.
|
| 81 |
+
|
| 82 |
+
```shell
|
| 83 |
+
cd models
|
| 84 |
+
git clone https://huggingface.co/emjay73/InfCam
|
| 85 |
+
```
|
| 86 |
+
Step 3: Test the example videos
|
| 87 |
+
|
| 88 |
+
```shell
|
| 89 |
+
bash run_inference.sh
|
| 90 |
+
```
|
| 91 |
+
or
|
| 92 |
+
```shell
|
| 93 |
+
for CAM in {1..10}; do
|
| 94 |
+
CUDA_VISIBLE_DEVICES=0 python inference_infcam.py \
|
| 95 |
+
--cam_type ${CAM} \
|
| 96 |
+
--ckpt_path "models/InfCam/step35000.ckpt" \
|
| 97 |
+
--camera_extrinsics_path "./sample_data/cameras/camera_extrinsics_10types.json" \
|
| 98 |
+
--output_dir "./results/sample_data" \
|
| 99 |
+
--dataset_path "./sample_data" \
|
| 100 |
+
--metadata_file_name "metadata.csv" \
|
| 101 |
+
--num_frames 81 --width 832 --height 480 \
|
| 102 |
+
--num_inference_steps 20 \
|
| 103 |
+
--zoom_factor 1.0 \
|
| 104 |
+
--k_from_unidepth \
|
| 105 |
+
--seed ${SEED}
|
| 106 |
+
done
|
| 107 |
+
```
|
| 108 |
+
This inference code requires 48 GB of memory for UniDepth and 28 GB for the InfCam pipeline.
|
| 109 |
+
|
| 110 |
+
Step 4: Test your own videos
|
| 111 |
+
|
| 112 |
+
If you want to test your own videos, you need to prepare your test data following the structure of the ```sample_data``` folder. This includes N mp4 videos, each with at least 81 frames, and a ```metadata.csv``` file that stores their paths and corresponding captions. You can refer to the '[caption branch](https://github.com/emjay73/InfCam/tree/caption) for metadata.csv extraction.
|
| 113 |
+
|
| 114 |
+
|
| 115 |
+
We provide several preset camera types, as shown in the table below.
|
| 116 |
+
These follow the ReCamMaster presets, but the starting point of each trajectory differs from that of the initial frame.
|
| 117 |
+
|
| 118 |
+
| cam_type | Trajectory |
|
| 119 |
+
|-------------------|-----------------------------|
|
| 120 |
+
| 1 | Pan Right |
|
| 121 |
+
| 2 | Pan Left |
|
| 122 |
+
| 3 | Tilt Up |
|
| 123 |
+
| 4 | Tilt Down |
|
| 124 |
+
| 5 | Zoom In |
|
| 125 |
+
| 6 | Zoom Out |
|
| 126 |
+
| 7 | Translate Up (with rotation) |
|
| 127 |
+
| 8 | Translate Down (with rotation) |
|
| 128 |
+
| 9 | Arc Left (with rotation) |
|
| 129 |
+
| 10 | Arc Right (with rotation) |
|
| 130 |
+
|
| 131 |
+
## 🤗 Thank You Note
|
| 132 |
+
Our work is based on the following repositories.\
|
| 133 |
+
Thank you for your outstanding contributions!
|
| 134 |
+
|
| 135 |
+
[ReCamMaster](https://jianhongbai.github.io/ReCamMaster/): Re-capture in-the-wild videos with novel camera trajectories, and release a multi-camera synchronized video dataset rendered with Unreal Engine 5.
|
| 136 |
+
|
| 137 |
+
[WAN2.1](https://github.com/Wan-Video/Wan2.1): A comprehensive and open suite of video foundation models.
|
| 138 |
+
|
| 139 |
+
[UniDepthV2](https://github.com/lpiccinelli-eth/UniDepth): Monocular metric depth estimation.
|
| 140 |
+
|
| 141 |
+
## 🌟 Citation
|
| 142 |
+
|
| 143 |
+
Please leave us a star 🌟 and cite our paper if you find our work helpful.
|
| 144 |
+
```bibtex
|
| 145 |
+
@article{kim2025infinite,
|
| 146 |
+
title={Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation},
|
| 147 |
+
author={Kim, Min-Jung and Kim, Jeongho and Jin, Hoiyeong and Hyung, Junha and Choo, Jaegul},
|
| 148 |
+
journal={arXiv preprint arXiv:2512.17040},
|
| 149 |
+
year={2025}
|
| 150 |
+
}
|
| 151 |
+
```
|