VGT

πŸš€ VGT: Visual Generation Tuning

**_Unleashing Visual Generation Capabilities from Any Pretrained VLM_**

This repository hosts models from the paper Visual Generation Tuning.

VGT (Visual Generation Tuning) is a groundbreaking paradigm designed to stimulate the underlying capabilities of visual generation within any pretrained Vision-Language Model (VLM). It significantly mitigates alignment costs and accelerates the convergence of autoregressive modeling in the continuous space, enabling efficient and high-quality image generation from text descriptions.

VGT Generated Images

✨ Highlights

  • 🎯 Novel Paradigm: Transform ANY pretrained Vision-Language Model into a powerful image generator through efficient visual generation tuning
  • ⚑ 20Γ— Speedup: Achieve dramatically faster convergence compared to vanilla VAE-based autoregressive models
  • πŸ“Š SOTA Performance: GenEval 0.83 and DPG-Bench 81.28 with minimal training data
  • πŸš€ Extreme Data Efficiency: Reach GenEval 0.55 in just 10K iterations, 0.60 in 30K iterations
  • πŸ”„ Parallel Inference: QueryAR mechanism enables 16Γ— parallel decoding while maintaining high-quality generation
  • 🎨 Superior Reconstruction: 26.67 PSNR and 0.50 rFID at 28Γ— compression ratio, outperforming specialized VAEs

πŸ’‘ What is VGT?

VGT (Visual Generation Tuning) is a groundbreaking paradigm that answers a fundamental question:

Can we directly leverage the well-aligned semantic representations in pretrained VLMs to enable visual generation capabilities?

VGT bridges this gap through two key innovations:

1. VGT-AE (Visual Generation Tuning - AutoEncoder)

  • Aligns semantic encoders from pretrained VLMs with latent representations of pixel decoders
  • Achieves 26.67 PSNR and 0.50 rFID at 28Γ— compression, outperforming specialized VAEs

2. VGT-AR (Visual Generation Tuning - AutoRegressive)

  • Position-query mechanism for autoregressive formulation with partial parallel decoding
  • Dramatically accelerates convergence (20Γ— speedup) compared to vanilla VAE-based models

πŸš€ Getting Started

Installation

# Clone the repository
git clone https://github.com/hustvl/VGT.git
cd VGT

# Install dependencies
conda create -n vgt python=3.10
conda activate vgt
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install mmengine xtuner tqdm timm
pip install diffusers transformers==4.57.1
pip install flash-attn --no-build-isolation

Pretrained Models

We provide VGT-tuned models based on Qwen2.5-VL and InternVL3 (448px):

Model Base Model GenEval DPG-Bench Download
VGT-InternVL3-1.6B-Pretrain InternVL3-1.6B 0.58 73.05 πŸ€— HuggingFace
VGT-InternVL3-1.6B-SFT InternVL3-1.6B 0.83 76.33 πŸ€— HuggingFace
VGT-Qwen2.5-VL-2B-Pretrain Qwen2.5-VL-2B 0.63 78.02 πŸ€— HuggingFace
VGT-Qwen2.5-VL-2B-SFT Qwen2.5-VL-2B 0.83 81.28 πŸ€— HuggingFace

Inference

Download the sft model checkpoint:

cd VGT
mkdir ckpts
hf download hustvl/vgt_qwen25vl_2B_sft --repo-type model --local-dir ckpts/hustvl/vgt_qwen25vl_2B_sft
hf download hustvl/vgt_internvl3_1_6B_sft --repo-type model --local-dir ckpts/hustvl/vgt_internvl3_1_6B_sft

Generate images from text prompts:

export PYTHONPATH=./:$PYTHONPATH

# use InternVL3-1.6B generate
python scripts/sample_text_list_vgt_intervl3_0.6B.py

Note: We found that under the same training method, VGT-Qwen2.5-VL-2B performs better in face generation, while VGT-InternVL3-1.6B performs better in generating landscapes, light and shadow, and animals. You can explore on your own.


πŸ“ Citation

If you find our work useful, please cite our paper:

@misc{guo2025vgt,
      title={Visual Generation Tuning}, 
      author={Jiahao Guo and Sinan Du and Jingfeng Yao and Wenyu Liu and Bo Li and Haoxiang Cao and Kun Gai and Chun Yuan and Kai Wu and Xinggang Wang},
      year={2025},
      eprint={2511.23469},
      archivePrefix={arXiv},
}

πŸ“„ License

This project is released under the MIT License. See LICENSE for details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support