tencent
/

HunyuanVideo-1.5

@@ -57,7 +57,9 @@ HunyuanVideo-1.5 is a video generation model that delivers top-tier quality with
 </p>
 ## 🔥🔥🔥 News
-* 🚀 Nov 24, 2025: We now support cache inference, achieving approximately 2x speedup! Pull the latest code to try it. 🔥🔥🔥🆕
 * 👋 Nov 20, 2025: We release the inference code and model weights of HunyuanVideo-1.5.
@@ -78,6 +80,8 @@ If you develop/use HunyuanVideo-1.5 in your projects, welcome to let us know.
 - **Wan2GP v9.62** - [Wan2GP](https://github.com/deepbeepmeep/Wan2GP): WanGP is a very low VRAM app (as low 6 GB of VRAM for Hunyuan Video 1.5) supports Lora Accelerator for a 8 steps generation and offers tools to facilitate Video Generation.
 ## 📑 Open-source Plan
 - HunyuanVideo-1.5 (T2V/I2V)
@@ -105,6 +109,7 @@ If you develop/use HunyuanVideo-1.5 in your projects, welcome to let us know.
   - [Command Line Arguments](#command-line-arguments)
   - [Optimal Inference Configurations](#optimal-inference-configurations)
 - [🧱 Models Cards](#-models-cards)
 - [🎬 More Examples](#-more-examples)
 - [📊 Evaluation](#-evaluation)
 - [📚 Citation](#-citation)
@@ -226,20 +231,22 @@ export I2V_REWRITE_MODEL_NAME="<your_model_name>"
 PROMPT='A girl holding a paper with words "Hello, world!"'
-IMAGE_PATH=./data/reference_image.png # Optional, 'none' or <image path>
 SEED=1
 ASPECT_RATIO=16:9
 RESOLUTION=480p
 OUTPUT_PATH=./outputs/output.mp4
 # Configuration
 N_INFERENCE_GPU=8 # Parallel inference GPU count
 CFG_DISTILLED=true # Inference with CFG distilled model, 2x speedup
 SPARSE_ATTN=false # Inference with sparse attention (only 720p models are equipped with sparse attention). Please ensure flex-block-attn is installed
 SAGE_ATTN=true # Inference with SageAttention
-REWRITE=true # Enable prompt rewriting. Please ensure rewrite vLLM server is deployed and configured.
 OVERLAP_GROUP_OFFLOADING=true # Only valid when group offloading is enabled, significantly increases CPU memory usage but speeds up inference
 ENABLE_CACHE=true # Enable feature cache during inference. Significantly speeds up inference.
 MODEL_PATH=ckpts # Path to pretrained model
 torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
@@ -248,14 +255,13 @@ torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
   --resolution $RESOLUTION \
   --aspect_ratio $ASPECT_RATIO \
   --seed $SEED \
-  --cfg_distilled $CFG_DISTILLED \
-  --sparse_attn $SPARSE_ATTN \
-  --use_sageattn $SAGE_ATTN \
-  --enable_cache $ENABLE_CACHE \
   --rewrite $REWRITE \
-  --output_path $OUTPUT_PATH \
   --overlap_group_offloading $OVERLAP_GROUP_OFFLOADING \
-  --save_pre_sr_video \
   --model_path $MODEL_PATH
 ```
@@ -295,8 +301,9 @@ torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
 | `--dtype` | str | No | `bf16` | Data type for transformer: `bf16` (faster, lower memory) or `fp32` (better quality, slower, higher memory) |
 | `--use_sageattn` | bool | No | `false` | Enable SageAttention (use `--use_sageattn` or `--use_sageattn true/1` to enable, `--use_sageattn false/0` to disable) |
 | `--sage_blocks_range` | str | No | `0-53` | SageAttention blocks range (e.g., `0-5` or `0,1,2,3,4,5`) |
-| `--enable_torch_compile` | bool | No | `false` | Enable torch compile for transformer (use `--enable_torch_compile` or `--enable_torch_compile true/1` to enable, `--enable_torch_compile false/0` to disable) |
 | `--enable_cache` | bool | No | `false` | Enable cache for transformer (use `--enable_cache` or `--enable_cache true/1` to enable, `--enable_cache false/0` to disable) |
 | `--cache_start_step` | int | No | `11` | Start step to skip when using cache |
 | `--cache_end_step` | int | No | `45` | End step to skip when using cache |
 | `--total_steps` | int | No | `50` | Total inference steps |
@@ -344,6 +351,32 @@ The following table provides the optimal inference configurations (CFG scale, em
 ## 🎬 More Examples
 |Features|Demo1|Demo2|
 |------|------|------|

 </p>
 ## 🔥🔥🔥 News
+* 📚 Training code is coming soon. HunyuanVideo-1.5 is trained using the Muon optimizer, which we have open-sourced in the in [Training](#-training) section. **If you would like to continue training our model or fine-tune it with LoRA, please use the Muon optimizer.**
+* 🚀 Nov 27, 2025: We now support cache inference (deepcache, teacache, taylorcache), achieving significant speedup! Pull the latest code to try it. 🔥🔥🔥🆕
+* 🚀 Nov 24, 2025: We now support deepcache inference.
 * 👋 Nov 20, 2025: We release the inference code and model weights of HunyuanVideo-1.5.
 - **Wan2GP v9.62** - [Wan2GP](https://github.com/deepbeepmeep/Wan2GP): WanGP is a very low VRAM app (as low 6 GB of VRAM for Hunyuan Video 1.5) supports Lora Accelerator for a 8 steps generation and offers tools to facilitate Video Generation.
+- **ComfyUI-MagCache** - [ComfyUI-MagCache](https://github.com/Zehong-Ma/ComfyUI-MagCache): MagCache is a training-free caching approach that accelerates video generation by estimating fluctuating differences among model outputs across timesteps. It achieves 1.7x speedup for HunyuanVideo-1.5 with 20 inference steps.
 ## 📑 Open-source Plan
 - HunyuanVideo-1.5 (T2V/I2V)
   - [Command Line Arguments](#command-line-arguments)
   - [Optimal Inference Configurations](#optimal-inference-configurations)
 - [🧱 Models Cards](#-models-cards)
+- [🎓 Training](#-training)
 - [🎬 More Examples](#-more-examples)
 - [📊 Evaluation](#-evaluation)
 - [📚 Citation](#-citation)
 PROMPT='A girl holding a paper with words "Hello, world!"'
+IMAGE_PATH=none # Optional, none or <image path> to enable i2v mode
 SEED=1
 ASPECT_RATIO=16:9
 RESOLUTION=480p
 OUTPUT_PATH=./outputs/output.mp4
 # Configuration
+REWRITE=true # Enable prompt rewriting. Please ensure rewrite vLLM server is deployed and configured.
 N_INFERENCE_GPU=8 # Parallel inference GPU count
 CFG_DISTILLED=true # Inference with CFG distilled model, 2x speedup
 SPARSE_ATTN=false # Inference with sparse attention (only 720p models are equipped with sparse attention). Please ensure flex-block-attn is installed
 SAGE_ATTN=true # Inference with SageAttention
 OVERLAP_GROUP_OFFLOADING=true # Only valid when group offloading is enabled, significantly increases CPU memory usage but speeds up inference
 ENABLE_CACHE=true # Enable feature cache during inference. Significantly speeds up inference.
+CACHE_TYPE=deepcache # Support: deepcache, teacache, taylorcache
+ENABLE_SR=true # Enable super resolution
 MODEL_PATH=ckpts # Path to pretrained model
 torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
   --resolution $RESOLUTION \
   --aspect_ratio $ASPECT_RATIO \
   --seed $SEED \
   --rewrite $REWRITE \
+  --cfg_distilled $CFG_DISTILLED \
+  --sparse_attn $SPARSE_ATTN --use_sageattn $SAGE_ATTN \
+  --enable_cache $ENABLE_CACHE --cache_type $CACHE_TYPE \
   --overlap_group_offloading $OVERLAP_GROUP_OFFLOADING \
+  --sr $ENABLE_SR --save_pre_sr_video \
+  --output_path $OUTPUT_PATH \
   --model_path $MODEL_PATH
 ```
 | `--dtype` | str | No | `bf16` | Data type for transformer: `bf16` (faster, lower memory) or `fp32` (better quality, slower, higher memory) |
 | `--use_sageattn` | bool | No | `false` | Enable SageAttention (use `--use_sageattn` or `--use_sageattn true/1` to enable, `--use_sageattn false/0` to disable) |
 | `--sage_blocks_range` | str | No | `0-53` | SageAttention blocks range (e.g., `0-5` or `0,1,2,3,4,5`) |
 | `--enable_cache` | bool | No | `false` | Enable cache for transformer (use `--enable_cache` or `--enable_cache true/1` to enable, `--enable_cache false/0` to disable) |
+| `--cache_type` | str | No | `deepcache` | Cache type for transformer (e.g., `deepcache, teacache, taylorcache`) |
+| `--no_cache_block_id` | str | No | `53` | Blocks to exclude from deepcache (e.g., `0-5` or `0,1,2,3,4,5`) |
 | `--cache_start_step` | int | No | `11` | Start step to skip when using cache |
 | `--cache_end_step` | int | No | `45` | End step to skip when using cache |
 | `--total_steps` | int | No | `50` | Total inference steps |
+## 🎓 Training
+> 💡 Training code is coming soon. We will release the complete training pipeline in the future.
+HunyuanVideo-1.5 is trained using the **Muon optimizer**, which accelerates convergence and improves training stability. The Muon optimizer combines momentum-based updates with Newton-Schulz orthogonalization for efficient optimization of large-scale video generation models.
+### Creating a Muon Optimizer
+Here's how to create a Muon optimizer for your model:
+```python
+from hyvideo.optim.muon import get_muon_optimizer
+# Create Muon optimizer for your model
+optimizer = get_muon_optimizer(
+    model=your_model,
+    lr=lr,                      # Learning rate
+    weight_decay=weight_decay,  # Weight decay
+    momentum=momentum,          # Momentum coefficient
+    adamw_betas=adamw_betas,   # AdamW betas for 1D parameters
+    adamw_eps=adamw_eps        # AdamW epsilon
+)
+```
+> 📝 **To be continued**: More training details and the complete training pipeline will be released soon. Stay tuned!
 ## 🎬 More Examples
 |Features|Demo1|Demo2|
 |------|------|------|

README_CN.md CHANGED Viewed

@@ -40,7 +40,9 @@ HunyuanVideo-1.5作为一款轻量级视频生成模型，仅需83亿参数即
 </p>
 ## 🔥🔥🔥 最新动态
-* 🚀 Nov 24, 2025: 我们现已支持 cache 推理，可实现约两倍加速！请 pull 最新代码体验。 🔥🔥🔥🆕
 * 👋 Nov 20, 2025: 我们开源了 HunyuanVideo-1.5的代码和推理权重
 ## 🎥 演示视频
@@ -60,6 +62,8 @@ HunyuanVideo-1.5作为一款轻量级视频生成模型，仅需83亿参数即
 - **Wan2GP v9.62** - [Wan2GP](https://github.com/deepbeepmeep/Wan2GP): Wan2GP 是一款对显存要求非常低的应用（在 Hunyuan Video 1.5 下最低仅需 6GB 显存），支持 Lora 加速器实现 8 步生成，并且提供多种视频生成辅助工具。
 ## 📑 开源计划
 - HunyuanVideo-1.5 (文生视频/图生视频)
@@ -86,6 +90,7 @@ HunyuanVideo-1.5作为一款轻量级视频生成模型，仅需83亿参数即
   - [命令行参数](#命令行参数)
   - [最优推理配置](#最优推理配置)
 - [🧱 模型卡片](#-模型卡片)
 - [🎬 更多示例](#-更多示例)
 - [📊 性能评估](#-性能评估)
 - [📚 引用](#-引用)
@@ -212,20 +217,22 @@ export I2V_REWRITE_MODEL_NAME="<your_model_name>"
 PROMPT='A girl holding a paper with words "Hello, world!"'
-IMAGE_PATH=./data/reference_image.png # 可选，'none' 或 <图像路径>
 SEED=1
 ASPECT_RATIO=16:9
 RESOLUTION=480p
 OUTPUT_PATH=./outputs/output.mp4
 # 配置
 N_INFERENCE_GPU=8 # 并行推理 GPU 数量
 CFG_DISTILLED=true # 使用 CFG 蒸馏模型进行推理，2倍加速
 SPARSE_ATTN=false # 使用稀疏注意力进行推理（仅 720p 模型配备了稀疏注意力）。请确保 flex-block-attn 已安装
 SAGE_ATTN=true # 使用 SageAttention 进行推理
-REWRITE=true # 启用提示词重写。请确保 rewrite vLLM server 已部署和配置。
 OVERLAP_GROUP_OFFLOADING=true # 仅在组卸载启用时有效，会显著增加 CPU 内存占用，但能够提速
 ENABLE_CACHE=true # 启用特征缓存进行推理。显著提升推理速度
 MODEL_PATH=ckpts # 预训练模型路径
 torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
@@ -234,14 +241,13 @@ torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
   --resolution $RESOLUTION \
   --aspect_ratio $ASPECT_RATIO \
   --seed $SEED \
-  --cfg_distilled $CFG_DISTILLED \
-  --sparse_attn $SPARSE_ATTN \
-  --use_sageattn $SAGE_ATTN \
-  --enable_cache $ENABLE_CACHE \
   --rewrite $REWRITE \
-  --output_path $OUTPUT_PATH \
   --overlap_group_offloading $OVERLAP_GROUP_OFFLOADING \
-  --save_pre_sr_video \
   --model_path $MODEL_PATH
 ```
@@ -282,6 +288,8 @@ torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
 | `--sage_blocks_range` | str | 否 | `0-53` | SageAttention 块范围（例如：`0-5` 或 `0,1,2,3,4,5`） |
 | `--enable_torch_compile` | bool | 否 | `false` | 启用 torch compile 以优化 transformer（使用 `--enable_torch_compile` 或 `--enable_torch_compile true/1` 来启用，`--enable_torch_compile false/0` 来禁用） |
 | `--enable_cache` | bool | 否 | `false` | 启用 transformer 缓存（使用 `--enable_cache` 或 `--enable_cache true/1` 来启用，`--enable_cache false/0` 来禁用） |
 | `--cache_start_step` | int | 否 | `11` | 使用缓存时跳过的起始步数 |
 | `--cache_end_step` | int | 否 | `45` | 使用缓存时跳过的结束步数 |
 | `--total_steps` | int | 否 | `50` | 总推理步数 |
@@ -329,6 +337,32 @@ torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
 ## 🎬 更多示例
 |特性|示例1|示例2|
 |------|------|------|

 </p>
 ## 🔥🔥🔥 最新动态
+* 📚 训练代码即将发布。HunyuanVideo-1.5 使用 Muon 优化器进行训练，我们在[Training](#-training) 部分开源。**如果您希望继续训练我们的模型，或使用 LoRA 进行微调，请使用 Muon 优化器。**
+* 🚀 Nov 27, 2025: 我们现已支持 cache 推理（deepcache, teacache, taylorcache），可极大加速推理！请 pull 最新代码体验。 🔥🔥🔥🆕
+* 🚀 Nov 24, 2025: 我们现已支持 deepcache 推理。
 * 👋 Nov 20, 2025: 我们开源了 HunyuanVideo-1.5的代码和推理权重
 ## 🎥 演示视频
 - **Wan2GP v9.62** - [Wan2GP](https://github.com/deepbeepmeep/Wan2GP): Wan2GP 是一款对显存要求非常低的应用（在 Hunyuan Video 1.5 下最低仅需 6GB 显存），支持 Lora 加速器实现 8 步生成，并且提供多种视频生成辅助工具。
+- **ComfyUI-MagCache** - [ComfyUI-MagCache](https://github.com/Zehong-Ma/ComfyUI-MagCache): MagCache 是一种无需训练的缓存方法，通过估计模型输出在不同时间步之间的波动差异来加速视频生成。在 20 步推理下，可为 HunyuanVideo-1.5 实现 1.7 倍加速。
 ## 📑 开源计划
 - HunyuanVideo-1.5 (文生视频/图生视频)
   - [命令行参数](#命令行参数)
   - [最优推理配置](#最优推理配置)
 - [🧱 模型卡片](#-模型卡片)
+- [🎓 训练](#-训练)
 - [🎬 更多示例](#-更多示例)
 - [📊 性能评估](#-性能评估)
 - [📚 引用](#-引用)
 PROMPT='A girl holding a paper with words "Hello, world!"'
+IMAGE_PATH=none # 可选，none 或 <图像路径> 以启用 i2v 模式
 SEED=1
 ASPECT_RATIO=16:9
 RESOLUTION=480p
 OUTPUT_PATH=./outputs/output.mp4
 # 配置
+REWRITE=true # 启用提示词重写。请确保 rewrite vLLM server 已部署和配置。
 N_INFERENCE_GPU=8 # 并行推理 GPU 数量
 CFG_DISTILLED=true # 使用 CFG 蒸馏模型进行推理，2倍加速
 SPARSE_ATTN=false # 使用稀疏注意力进行推理（仅 720p 模型配备了稀疏注意力）。请确保 flex-block-attn 已安装
 SAGE_ATTN=true # 使用 SageAttention 进行推理
 OVERLAP_GROUP_OFFLOADING=true # 仅在组卸载启用时有效，会显著增加 CPU 内存占用，但能够提速
 ENABLE_CACHE=true # 启用特征缓存进行推理。显著提升推理速度
+CACHE_TYPE=deepcache # 支持：deepcache, teacache, taylorcache
+ENABLE_SR=true # 启用超分辨率
 MODEL_PATH=ckpts # 预训练模型路径
 torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
   --resolution $RESOLUTION \
   --aspect_ratio $ASPECT_RATIO \
   --seed $SEED \
   --rewrite $REWRITE \
+  --cfg_distilled $CFG_DISTILLED \
+  --sparse_attn $SPARSE_ATTN --use_sageattn $SAGE_ATTN \
+  --enable_cache $ENABLE_CACHE --cache_type $CACHE_TYPE \
   --overlap_group_offloading $OVERLAP_GROUP_OFFLOADING \
+  --sr $ENABLE_SR --save_pre_sr_video \
+  --output_path $OUTPUT_PATH \
   --model_path $MODEL_PATH
 ```
 | `--sage_blocks_range` | str | 否 | `0-53` | SageAttention 块范围（例如：`0-5` 或 `0,1,2,3,4,5`） |
 | `--enable_torch_compile` | bool | 否 | `false` | 启用 torch compile 以优化 transformer（使用 `--enable_torch_compile` 或 `--enable_torch_compile true/1` 来启用，`--enable_torch_compile false/0` 来禁用） |
 | `--enable_cache` | bool | 否 | `false` | 启用 transformer 缓存（使用 `--enable_cache` 或 `--enable_cache true/1` 来启用，`--enable_cache false/0` 来禁用） |
+| `--cache_type` | str | 否 | `deepcache` | Transformer 的缓存类型（例如：`deepcache, teacache, taylorcache`） |
+| `--no_cache_block_id` | str | 否 | `53` | 从 deepcache 中排除的块（例如：`0-5` 或 `0,1,2,3,4,5`） |
 | `--cache_start_step` | int | 否 | `11` | 使用缓存时跳过的起始步数 |
 | `--cache_end_step` | int | 否 | `45` | 使用缓存时跳过的结束步数 |
 | `--total_steps` | int | 否 | `50` | 总推理步数 |
+## 🎓 训练
+> 💡 训练代码即将发布。我们将在未来发布完整���训练流程。
+HunyuanVideo-1.5 使用 **Muon 优化器**进行训练，该优化器能够加速收敛并提高训练稳定性。Muon 优化器结合了基于动量的更新和 Newton-Schulz 正交化方法，可高效优化大规模视频生成模型。
+### 创建 Muon 优化器
+以下是如何为您的模型创建 Muon 优化器：
+```python
+from hyvideo.optim.muon import get_muon_optimizer
+# 为您的模型创建 Muon 优化器
+optimizer = get_muon_optimizer(
+    model=your_model,
+    lr=lr,                      # 学习率
+    weight_decay=weight_decay,  # 权重衰减
+    momentum=momentum,          # 动量系数
+    adamw_betas=adamw_betas,   # 1D 参数的 AdamW betas
+    adamw_eps=adamw_eps        # AdamW epsilon
+)
+```
+> 📝 **未完待续**：更多训练细节和完整的训练流程即将发布，敬请期待！
 ## 🎬 更多示例
 |特性|示例1|示例2|
 |------|------|------|