π VGT: Visual Generation Tuning
**_Unleashing Visual Generation Capabilities from Any Pretrained VLM_**This repository hosts models from the paper Visual Generation Tuning.
VGT (Visual Generation Tuning) is a groundbreaking paradigm designed to stimulate the underlying capabilities of visual generation within any pretrained Vision-Language Model (VLM). It significantly mitigates alignment costs and accelerates the convergence of autoregressive modeling in the continuous space, enabling efficient and high-quality image generation from text descriptions.
- Paper: Visual Generation Tuning
- Code: https://github.com/hustvl/VGT
β¨ Highlights
- π― Novel Paradigm: Transform ANY pretrained Vision-Language Model into a powerful image generator through efficient visual generation tuning
- β‘ 20Γ Speedup: Achieve dramatically faster convergence compared to vanilla VAE-based autoregressive models
- π SOTA Performance: GenEval 0.83 and DPG-Bench 81.28 with minimal training data
- π Extreme Data Efficiency: Reach GenEval 0.55 in just 10K iterations, 0.60 in 30K iterations
- π Parallel Inference: QueryAR mechanism enables 16Γ parallel decoding while maintaining high-quality generation
- π¨ Superior Reconstruction: 26.67 PSNR and 0.50 rFID at 28Γ compression ratio, outperforming specialized VAEs
π‘ What is VGT?
VGT (Visual Generation Tuning) is a groundbreaking paradigm that answers a fundamental question:
Can we directly leverage the well-aligned semantic representations in pretrained VLMs to enable visual generation capabilities?
VGT bridges this gap through two key innovations:
1. VGT-AE (Visual Generation Tuning - AutoEncoder)
- Aligns semantic encoders from pretrained VLMs with latent representations of pixel decoders
- Achieves 26.67 PSNR and 0.50 rFID at 28Γ compression, outperforming specialized VAEs
2. VGT-AR (Visual Generation Tuning - AutoRegressive)
- Position-query mechanism for autoregressive formulation with partial parallel decoding
- Dramatically accelerates convergence (20Γ speedup) compared to vanilla VAE-based models
π Getting Started
Installation
# Clone the repository
git clone https://github.com/hustvl/VGT.git
cd VGT
# Install dependencies
conda create -n vgt python=3.10
conda activate vgt
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install mmengine xtuner tqdm timm
pip install diffusers transformers==4.57.1
pip install flash-attn --no-build-isolation
Pretrained Models
We provide VGT-tuned models based on Qwen2.5-VL and InternVL3 (448px):
| Model | Base Model | GenEval | DPG-Bench | Download |
|---|---|---|---|---|
| VGT-InternVL3-1.6B-Pretrain | InternVL3-1.6B | 0.58 | 73.05 | π€ HuggingFace |
| VGT-InternVL3-1.6B-SFT | InternVL3-1.6B | 0.83 | 76.33 | π€ HuggingFace |
| VGT-Qwen2.5-VL-2B-Pretrain | Qwen2.5-VL-2B | 0.63 | 78.02 | π€ HuggingFace |
| VGT-Qwen2.5-VL-2B-SFT | Qwen2.5-VL-2B | 0.83 | 81.28 | π€ HuggingFace |
Inference
Download the sft model checkpoint:
cd VGT
mkdir ckpts
hf download hustvl/vgt_qwen25vl_2B_sft --repo-type model --local-dir ckpts/hustvl/vgt_qwen25vl_2B_sft
hf download hustvl/vgt_internvl3_1_6B_sft --repo-type model --local-dir ckpts/hustvl/vgt_internvl3_1_6B_sft
Generate images from text prompts:
export PYTHONPATH=./:$PYTHONPATH
# use InternVL3-1.6B generate
python scripts/sample_text_list_vgt_intervl3_0.6B.py
Note: We found that under the same training method, VGT-Qwen2.5-VL-2B performs better in face generation, while VGT-InternVL3-1.6B performs better in generating landscapes, light and shadow, and animals. You can explore on your own.
π Citation
If you find our work useful, please cite our paper:
@misc{guo2025vgt,
title={Visual Generation Tuning},
author={Jiahao Guo and Sinan Du and Jingfeng Yao and Wenyu Liu and Bo Li and Haoxiang Cao and Kun Gai and Chun Yuan and Kai Wu and Xinggang Wang},
year={2025},
eprint={2511.23469},
archivePrefix={arXiv},
}
π License
This project is released under the MIT License. See LICENSE for details.