Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield Paper • 2511.22677 • Published 9 days ago • 18
RELIC: Interactive Video World Model with Long-Horizon Memory Paper • 2512.04040 • Published 3 days ago • 20
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer Paper • 2511.22699 • Published 9 days ago • 145
STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow Paper • 2511.20462 • Published 11 days ago • 29
MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation Paper • 2511.09611 • Published 24 days ago • 68
PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image Paper • 2511.13648 • Published 19 days ago • 52
Back to Basics: Let Denoising Generative Models Denoise Paper • 2511.13720 • Published 19 days ago • 63
Depth Anything 3: Recovering the Visual Space from Any Views Paper • 2511.10647 • Published 23 days ago • 92
Vision Language Models: 2025 Update Collection This collection includes all the models, datasets and Spaces mentioned in the blog Vision Language Models: 2025 Update • 67 items • Updated May 12 • 5
MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency Paper • 2510.25897 • Published Oct 29 • 16
TradingAgents: Multi-Agents LLM Financial Trading Framework Paper • 2412.20138 • Published Dec 28, 2024 • 14
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation Paper • 2410.17799 • Published Oct 23, 2024 • 5
Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing Paper • 2510.19808 • Published Oct 22 • 28
RAP: Real-time Audio-driven Portrait Animation with Video Diffusion Transformer Paper • 2508.05115 • Published Aug 7 • 3
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time Paper • 2404.10667 • Published Apr 16, 2024 • 23
Dawn of the transformer era in speech emotion recognition: closing the valence gap Paper • 2203.07378 • Published Mar 14, 2022 • 2
Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Paper • 2509.19296 • Published Sep 23 • 23
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning Paper • 2509.08519 • Published Sep 10 • 128