CaptionQA: Is Your Caption as Useful as the Image Itself? Paper • 2511.21025 • Published 11 days ago • 25
DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning Paper • 2511.22570 • Published 10 days ago • 63
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer Paper • 2511.22699 • Published 10 days ago • 146
Black-Box On-Policy Distillation of Large Language Models Paper • 2511.10643 • Published 24 days ago • 46
PAN: A World Model for General, Interactable, and Long-Horizon World Simulation Paper • 2511.09057 • Published 25 days ago • 75
Revisiting Multimodal Positional Encoding in Vision-Language Models Paper • 2510.23095 • Published Oct 27 • 20
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation Paper • 2511.01163 • Published Nov 3 • 31
Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset Paper • 2510.15742 • Published Oct 17 • 50
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM Paper • 2510.15870 • Published Oct 17 • 89
QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs Paper • 2510.11696 • Published Oct 13 • 176
StreamingVLM: Real-Time Understanding for Infinite Video Streams Paper • 2510.09608 • Published Oct 10 • 50
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models Paper • 2510.05034 • Published Oct 6 • 48
VideoNSA: Native Sparse Attention Scales Video Understanding Paper • 2510.02295 • Published Oct 2 • 9
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training Paper • 2509.26625 • Published Sep 30 • 43
InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation Paper • 2509.24663 • Published Sep 29 • 13