tttoaster commited on
Commit
15e4fb8
·
verified ·
1 Parent(s): c365613

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -7
README.md CHANGED
@@ -3,13 +3,16 @@
3
  <!-- [![arXiv](https://img.shields.io/badge/arXiv-2404.14396-b31b1b.svg)](https://arxiv.org/abs/2404.14396)-->
4
 
5
  [![Demo](https://img.shields.io/badge/ARC-Demo-blue)](https://arc.tencent.com/en/ai-demos/multimodal)
 
6
  [![Static Badge](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B)
7
  [![Blog](https://img.shields.io/badge/ARC-Blog-green)](https://tencentarc.github.io/posts/arc-video-announcement/)
8
 
9
  <span style="font-size:smaller;">
10
  Please note that in our Demo, ARC-Hunyuan-Video-7B is the model consistent with the model checkpoint and the one described in the paper, while ARC-Hunyuan-Video-7B-V0 only supports video description and summarization in Chinese.
 
11
  </span>
12
 
 
13
  ## Introduction
14
 
15
  We introduce **ARC-Hunyuan-Video-7B**, a powerful multimodal model designed for _understanding real-world short videos_.
@@ -17,6 +20,8 @@ Understanding user-generated videos is actually challenging due to their complex
17
  information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery.
18
  To address this challenge, ARC-Hunyuan-Video-7B
19
  processes visual, audio, and textual signals end-to-end for a deep, structured understanding of video through integrating and reasoning over multimodal cues.
 
 
20
 
21
  Compared to prior arts, we introduces a new paradigm of **Structured Video Comprehension**, with capabilities including:
22
 
@@ -47,10 +52,12 @@ Specifically, ARC-Hunyuan-Video-7B is built on top of the Hunyuan-7B vision-lang
47
  ## News
48
 
49
  - 2025.07.25: We release the [model checkpoint](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B) and inference code of ARC-Hunyuan-Video-7B including [vLLM](https://github.com/vllm-project/vllm) version.
50
- - 2025.07.25: We release the [API service](https://arc.tencent.com/en/document/ARC-Video-7B) of ARC-Hunyuan-Video-7B, which is supported by [vLLM](https://github.com/vllm-project/vllm). We release two versions: one is V0, which only supports video description and summarization in Chinese; the other is the version consistent with the model checkpoint and the one described in the paper.
51
 
52
  ## Usage
53
-
 
 
54
  ### Installation
55
 
56
  Clone the repo and install dependent packages
@@ -58,17 +65,24 @@ Clone the repo and install dependent packages
58
  ```bash
59
  git clone https://github.com/TencentARC/ARC-Hunyuan-Video-7B.git
60
  cd ARC-Hunyuan-Video-7B
 
 
61
  pip install -r requirements.txt
62
  pip install git+https://github.com/liyz15/transformers.git@arc_hunyuan_video
63
 
64
- # For vllm, please follow the instructions below,
 
 
 
 
 
65
  git submodule update --init --recursive
66
  cd model_vllm/vllm/
67
  export SETUPTOOLS_SCM_PRETEND_VERSION="0.8.5"
68
  wget https://wheels.vllm.ai/ed2462030f2ccc84be13d8bb2c7476c84930fb71/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
69
- export VLLM_PRECOMPILED_WHEEL_LOCATION=path_of_whl
70
  pip install --editable .
71
- # Please install corresponding package based on your python version
72
  pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
73
  ```
74
 
@@ -77,6 +91,11 @@ pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.
77
  - Download [ARC-Hunyuan-Video-7B](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B) including ViT and LLM and the original [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) .
78
 
79
  ### Inference
 
 
 
 
 
80
 
81
  #### Inference without vllm
82
 
@@ -96,9 +115,12 @@ python3 video_inference_vllm.py
96
 
97
  We also provide access to the model via API, which is supported by [vLLM](https://github.com/vllm-project/vllm). For details, please refer to the [documentation](https://arc.tencent.com/zh/document/ARC-Hunyuan-Video-7B).
98
 
99
- We release two versions: one is V0, which only supports video description and summarization in Chinese; the other is the version consistent with the model checkpoint and the one described in the paper, which is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning ( It supports Chinese and English videos and particularly excels at Chinese).
 
 
 
100
 
101
- If you only need to understand and summarize short Chinese videos, we recommend using the V0 version
102
 
103
  ## Future Work
104
 
@@ -117,3 +139,5 @@ If you find the work helpful, please consider citing:
117
  }
118
  ```
119
  -->
 
 
 
3
  <!-- [![arXiv](https://img.shields.io/badge/arXiv-2404.14396-b31b1b.svg)](https://arxiv.org/abs/2404.14396)-->
4
 
5
  [![Demo](https://img.shields.io/badge/ARC-Demo-blue)](https://arc.tencent.com/en/ai-demos/multimodal)
6
+ [![Code](https://img.shields.io/badge/Github-Code-orange)](https://github.com/TencentARC/ARC-Hunyuan-Video-7B)
7
  [![Static Badge](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B)
8
  [![Blog](https://img.shields.io/badge/ARC-Blog-green)](https://tencentarc.github.io/posts/arc-video-announcement/)
9
 
10
  <span style="font-size:smaller;">
11
  Please note that in our Demo, ARC-Hunyuan-Video-7B is the model consistent with the model checkpoint and the one described in the paper, while ARC-Hunyuan-Video-7B-V0 only supports video description and summarization in Chinese.
12
+ Due to API file size limits, our demo uses compressed input video resolutions, which may cause slight performance differences from the paper. For original results, please run locally.
13
  </span>
14
 
15
+
16
  ## Introduction
17
 
18
  We introduce **ARC-Hunyuan-Video-7B**, a powerful multimodal model designed for _understanding real-world short videos_.
 
20
  information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery.
21
  To address this challenge, ARC-Hunyuan-Video-7B
22
  processes visual, audio, and textual signals end-to-end for a deep, structured understanding of video through integrating and reasoning over multimodal cues.
23
+ Stress test reports show an inference time of just 10 seconds for a one-minute video on H20 GPU, yielding an average of 500 tokens, with
24
+ inference accelerated by the vLLM framework.
25
 
26
  Compared to prior arts, we introduces a new paradigm of **Structured Video Comprehension**, with capabilities including:
27
 
 
52
  ## News
53
 
54
  - 2025.07.25: We release the [model checkpoint](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B) and inference code of ARC-Hunyuan-Video-7B including [vLLM](https://github.com/vllm-project/vllm) version.
55
+ - 2025.07.25: We release the [API service](https://arc.tencent.com/zh/document/ARC-Hunyuan-Video-7B) of ARC-Hunyuan-Video-7B, which is supported by [vLLM](https://github.com/vllm-project/vllm). We release two versions: one is V0, which only supports video description and summarization in Chinese; the other is the version consistent with the model checkpoint and the one described in the paper.
56
 
57
  ## Usage
58
+ ### Dependencies
59
+ - Our inference can be performed on a single NVIDIA A100 40GB GPU.
60
+ - For the vLLM deployment version, we recommend using two NVIDIA A100 40GB GPUs.
61
  ### Installation
62
 
63
  Clone the repo and install dependent packages
 
65
  ```bash
66
  git clone https://github.com/TencentARC/ARC-Hunyuan-Video-7B.git
67
  cd ARC-Hunyuan-Video-7B
68
+ # Install torch 2.6.0
69
+ pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
70
  pip install -r requirements.txt
71
  pip install git+https://github.com/liyz15/transformers.git@arc_hunyuan_video
72
 
73
+ # Install flash-attention based on your python version
74
+ # If you are unable to install flash-attention, you can modify attn_implementation to "sdpa" in video_inference.py
75
+ pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
76
+
77
+
78
+ # (Optional) For vllm, please follow the instructions below,
79
  git submodule update --init --recursive
80
  cd model_vllm/vllm/
81
  export SETUPTOOLS_SCM_PRETEND_VERSION="0.8.5"
82
  wget https://wheels.vllm.ai/ed2462030f2ccc84be13d8bb2c7476c84930fb71/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
83
+ export VLLM_PRECOMPILED_WHEEL_LOCATION=$(pwd)/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
84
  pip install --editable .
85
+ # Install flash-attention if you haven't installed it
86
  pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
87
  ```
88
 
 
91
  - Download [ARC-Hunyuan-Video-7B](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B) including ViT and LLM and the original [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) .
92
 
93
  ### Inference
94
+ ```bash
95
+ # Our model currently excels at processing short videos of up to 5 minutes.
96
+ # If your video is longer, we recommend following the approach used in our demo and API:
97
+ # split the video into segments for inference, and then use an LLM to integrate the results.
98
+ ```
99
 
100
  #### Inference without vllm
101
 
 
115
 
116
  We also provide access to the model via API, which is supported by [vLLM](https://github.com/vllm-project/vllm). For details, please refer to the [documentation](https://arc.tencent.com/zh/document/ARC-Hunyuan-Video-7B).
117
 
118
+ We release two versions: one is V0, which only supports video description and summarization in Chinese; the other is the version consistent with the model checkpoint and the one described in the paper, which is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning (It supports Chinese and English videos and particularly excels at Chinese).
119
+ For videos longer than 5 minutes, we only support structured descriptions. We process these videos in 5-minute segments and use an LLM to integrate the inference results.
120
+
121
+ If you only need to understand and summarize short Chinese videos, we recommend using the V0 version.
122
 
123
+ Due to video file size limitations imposed by the deployment API, we compressed input video resolutions for our online demo and API services. Consequently, model performance in these interfaces may slightly deviate from the results reported in the paper. To reproduce the original performance, we recommend local inference.
124
 
125
  ## Future Work
126
 
 
139
  }
140
  ```
141
  -->
142
+
143
+