File size: 11,359 Bytes
7e034da
 
 
05774c0
 
 
 
 
 
 
 
 
 
7e034da
 
9a3bcb8
7e034da
 
 
21e2143
7e034da
 
 
 
16429aa
60f837a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16429aa
60f837a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97e5f4e
16429aa
97e5f4e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16429aa
7e034da
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
license: apache-2.0
---
<p align="center"><h1 align="center">
Bridging the Last Mile: Deploying Hummingbird-XT for Efficient Video Generation on AMD Consumer-Grade Platforms
</h1>
</p>

<p align="center">
<h3 align="center"><a href="https://rocm.blogs.amd.com/artificial-intelligence/hummingbirdxt/README.html">Blog</a> | <a href="https://github.com/AMD-AGI/HummingbirdXT">Code</a></h3>
</p>


In this work, we present **AMD Hummingbird-XT**, an efficient **DiT-based** video generative model designed for high-quality video generation on client-grade GPUs with **5B parameters** .
 
Hummingbird-XT is trained based on Wan2.2-5B-TI2V using **DMD step distillation** with carefully designed **data curation**, enabling **3-step generation** while preserving high visual fidelity and motion quality. To reduce the computational overhead of high-resolution video decoding in 3D convolution–based VAE decoders, we introduce a **lightweight and efficient VAE decoder** by replacing part of the 3D convolutions with depthwise separable convolutions. Additionally, to further extend the length of generated videos, we introduce **Hummingbird-XTX**, an efficient **autoregressive model** for **long-video generation** based on Wan-2.1-1.3B, which is capable of generating long videos.
 
As a result, Hummingbird-XT achieves a **33×** speedup on Strix Halo iGPU and a **40×** speedup on  AMD Instinct™ MI325, and supports generating **121-frame** videos at **720×1280** resolution across both server-grade (AMD Instinct™ MI300 and AMD Instinct™ MI325) and client-grade (Strix Halo and Navi48) devices. Quantitative results on the VBench-T2V and VBench-I2V benchmarks show that Hummingbird-XT achieves competitive performance compared to the original **Wan2.2-5B-TI2V** model.
 
The Training and inference code is fully released on [Hummingbird-XT](https://github.com/AMD-AGI/HummingbirdXT), and the technical details is released on [Bridging the Last Mile: Deploying Hummingbird-XT for Efficient Video Generation on AMD Consumer-Grade Platforms](https://rocm.blogs.amd.com/artificial-intelligence/hummingbirdxt/README.html).
 
 
 
 
<p align="center"><strong>Hummingbird-XT Text-to-Video Showcases</strong></p>
<table style="width: 90%; max-width: 900px; margin: 20px auto; border-collapse: separate; border-spacing: 0; box-shadow: 0 2px 8px rgba(0,0,0,0.1); border-radius: 8px; overflow: hidden; font-family: Arial, sans-serif;">
  <thead style="background-color: #f5f5f5;">
    <tr>
      <th style="width: 30%; padding: 12px; text-align: left; font-weight: bold;">Caption</th>
      <th style="width: 70%; padding: 12px; text-align: center; font-weight: bold;">Video</th>
    </tr>
  </thead>
  <tbody>
    <tr style="border-top: 1px solid #ddd;">
      <td style="padding: 12px; max-height: 150px; overflow: hidden; text-overflow: ellipsis; display: -webkit-box; -webkit-line-clamp: 6; -webkit-box-orient: vertical; vertical-align: top;">
        Animated scene features a close-up of a short fluffy monster kneeling beside a melting red candle. The art style is 3D and realistic, with a focus on lighting and texture. The mood of the painting is one of wonder and curiosity, as the monster gazes at the flame with wide eyes and open mouth. Its pose and expression convey a sense of innocence and playfulness, as if it is exploring the world around it for the first time. The use of warm colors and dramatic lighting further enhances the cozy atmosphere of the image.
      </td>
      <td style="padding: 12px; text-align: center;">
        <video src="https://github.com/user-attachments/assets/97beef02-ed76-4635-8b36-a296c227cab1" width="100%" controls autoplay loop muted style="border-radius: 6px;"></video>
      </td>
    </tr>
    <tr style="border-top: 1px solid #ddd;">
      <td style="padding: 12px; max-height: 150px; overflow: hidden; text-overflow: ellipsis; display: -webkit-box; -webkit-line-clamp: 6; -webkit-box-orient: vertical; vertical-align: top;">
        A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
      </td>
      <td style="padding: 12px; text-align: center;">
        <video src="https://github.com/user-attachments/assets/6698d25f-e839-4acd-b5cd-af8f325d37fc" width="100%" controls autoplay loop muted style="border-radius: 6px;"></video>
      </td>
    </tr>
    <tr style="border-top: 1px solid #ddd;">
      <td style="padding: 12px; max-height: 150px; overflow: hidden; text-overflow: ellipsis; display: -webkit-box; -webkit-line-clamp: 6; -webkit-box-orient: vertical; vertical-align: top;">
        The young East Asian man with short black hair, fair skin, and monolid eyes looks ahead. A young East Asian woman with long black hair and fair skin turns to smile warmly at him. The background is blurred, focusing on their shared gaze. Realistic cinematic style.
      </td>
      <td style="padding: 12px; text-align: center;">
        <video src="https://github.com/user-attachments/assets/064c242a-4ee5-429e-9a9b-9b12df076c96" width="100%" controls autoplay loop muted style="border-radius: 6px;"></video>
      </td>
    </tr>
  </tbody>
</table>

<p align="center"><strong>Hummingbird-XT Image-to-Video Showcases</strong></p>
<table style="width: 90%; max-width: 900px; margin: 20px auto; border-collapse: separate; border-spacing: 0; box-shadow: 0 2px 8px rgba(0,0,0,0.1); border-radius: 8px; overflow: hidden; font-family: Arial, sans-serif;">
  <thead style="background-color: #f5f5f5;">
    <tr>
      <th style="width: 30%; padding: 12px; text-align: left; font-weight: bold;">Caption</th>
      <th style="width: 70%; padding: 12px; text-align: center; font-weight: bold;">Video</th>
    </tr>
  </thead>
  <tbody>
    <tr style="border-top: 1px solid #ddd;">
      <td style="padding: 12px; max-height: 150px; overflow: hidden; text-overflow: ellipsis; display: -webkit-box; -webkit-line-clamp: 6; -webkit-box-orient: vertical; vertical-align: top;">
        a back-view close-up focusing on the runner’s feet striking the track. Only subtle movement occurs—his steps land firmly, kicking a small amount of dust or rubber granules. The camera stays low and straight-on behind him, following smoothly with minimal shake. The sunlight bright with long shadows stretching forward.
      </td>
      <td style="padding: 12px; text-align: center;">
        <video src="https://github.com/user-attachments/assets/d01d9fe7-bebe-4f0b-902a-3e913d93df1d" width="100%" controls autoplay loop muted style="border-radius: 6px;"></video>
      </td>
    </tr>
    <tr style="border-top: 1px solid #ddd;">
      <td style="padding: 12px; max-height: 150px; overflow: hidden; text-overflow: ellipsis; display: -webkit-box; -webkit-line-clamp: 6; -webkit-box-orient: vertical; vertical-align: top;">
        A graceful woman stands under a majestic sandstone arch, forming a small heart shape with her fingers close to the camera while smiling warmly and radiating joy. Behind her, a smooth and elegant fountain rises gracefully, its water reflecting the warm, inviting courtyard walls in a mirror-like fashion.
      </td>
      <td style="padding: 12px; text-align: center;">
        <video src="https://github.com/user-attachments/assets/d4197430-13e7-46d9-b2a9-80df0aee491d" width="100%" controls autoplay loop muted style="border-radius: 6px;"></video>
      </td>
    </tr>
    <tr style="border-top: 1px solid #ddd;">
      <td style="padding: 12px; max-height: 150px; overflow: hidden; text-overflow: ellipsis; display: -webkit-box; -webkit-line-clamp: 6; -webkit-box-orient: vertical; vertical-align: top;">
        舞台上,一名男子弹奏着一把由闪电构成的电吉他。随着音乐渐强,火花在他周围噼啪作响。突然,耀眼的光芒转为暗红色,他的双眼发出>幽光,黑色的翅膀从背后羽化而出。他的皮肤变得黝黑,闪电缠绕>着他的身体,他化身为一个恶魔,伫立在翻滚的烟雾和雷鸣之中。
      </td>
      <td style="padding: 12px; text-align: center;">
        <video src="https://github.com/user-attachments/assets/7fc77ead-cc5e-4a98-b678-21ed80b91e8c" width="100%" controls autoplay loop muted style="border-radius: 6px;"></video>
      </td>
    </tr>
  </tbody>
</table>

<p align="center"><strong>Hummingbird-XTX 20s videos Showcases</strong></p>
<table style="width: 90%; max-width: 900px; margin: 20px auto; border-collapse: separate; border-spacing: 0; box-shadow: 0 2px 8px rgba(0,0,0,0.1); border-radius: 8px; overflow: hidden; font-family: Arial, sans-serif;">
  <thead style="background-color: #f5f5f5;">
    <tr>
      <th style="width: 30%; padding: 12px; text-align: left; font-weight: bold;">Caption</th>
      <th style="width: 70%; padding: 12px; text-align: center; font-weight: bold;">Video</th>
    </tr>
  </thead>
  <tbody>
    <tr style="border-top: 1px solid #ddd;">
      <td style="padding: 12px; max-height: 150px; overflow: hidden; text-overflow: ellipsis; display: -webkit-box; -webkit-line-clamp: 6; -webkit-box-orient: vertical; vertical-align: top;">
        Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.
      </td>
      <td style="padding: 12px; text-align: center;">
        <video src="https://github.com/user-attachments/assets/8dcab976-6ac6-419e-82f9-71a0b8d8fe7e" width="100%" controls autoplay loop muted style="border-radius: 6px;"></video>
      </td>
    </tr>
    <tr style="border-top: 1px solid #ddd;">
      <td style="padding: 12px; max-height: 150px; overflow: hidden; text-overflow: ellipsis; display: -webkit-box; -webkit-line-clamp: 6; -webkit-box-orient: vertical; vertical-align: top;">
        A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.
      </td>
      <td style="padding: 12px; text-align: center;">
        <video src="https://github.com/user-attachments/assets/6c1e0e92-8521-4402-8652-35b87efac7ed" width="100%" controls autoplay loop muted style="border-radius: 6px;"></video>
      </td>
    </tr>
    <tr style="border-top: 1px solid #ddd;">
      <td style="padding: 12px; max-height: 150px; overflow: hidden; text-overflow: ellipsis; display: -webkit-box; -webkit-line-clamp: 6; -webkit-box-orient: vertical; vertical-align: top;">
        A cinematic wide portrait of a man with his face lit by the glow of a TV.
      </td>
      <td style="padding: 12px; text-align: center;">
        <video src="https://github.com/user-attachments/assets/49465535-34b5-49f4-925c-ca3379b92dc1" width="100%" controls autoplay loop muted style="border-radius: 6px;"></video>
      </td>
    </tr>
  </tbody>
</table>