Safetensors
qwen2_5_vl
mmaaz60 commited on
Commit
b3f7b8a
·
verified ·
1 Parent(s): 90f45ff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -3
README.md CHANGED
@@ -1,3 +1,87 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # Video-R2
6
+
7
+ **Github:** https://github.com/mbzuai-oryx/Video-R2
8
+
9
+ **Paper:** https://arxiv.org/abs/2511.23478
10
+
11
+ ## Overview
12
+
13
+ Video-R2 is a video reasoning multimodal language model (MLLM) designed to produce consistent, temporally grounded, and visually faithful reasoning over dynamic video content.
14
+
15
+ It addresses two common failure modes of prior video reasoning models:
16
+ - Logical inconsistency between reasoning and final answers
17
+ - Over-reliance on linguistic priors instead of video evidence
18
+
19
+ In the paper, we propose two diagnostic metrics to address these issues:
20
+
21
+ - **Think-Answer Consistency (TAC):** alignment between the generated reasoning and the final answer.
22
+ - **Video Attention Score (VAS):** extent to which the model's reasoning relies on video evidence rather than linguistic priors or world knowledge.
23
+
24
+ ![Front Page Figure](images/video_r2_figure_1.png)
25
+
26
+ **Inconsistent reasoning in prior video LLMs and improved visual reliance with Video-R2.**
27
+ Given the video and the question *“How many transactions does Visa have in one day?”* both **Video-R1** and **VideoChat-R1** conclude option A during their reasoning but ultimately predict option E, showing that the model’s reasoning and final answer do not match. This behavior occurs because these models rely heavily on textual context and prior knowledge while attending weakly to the video. In contrast, **Video-R2** correctly identifies the on screen visual cue at `01:45` (*“23,666 transactions/sec”*), performs temporal conversion, and arrives at the correct daily value. The box plot on the right shows the average attention from generated tokens to video tokens across all attention heads in the final transformer layer. Compared with baselines, **Video-R2** assigns higher and more distributed attention to video tokens, indicating stronger and more adaptive visual reliance. While earlier models often produce plausible yet inconsistent reasoning, **Video-R2** reasons coherently and grounds its decisions in actual video evidence.
28
+
29
+ ---
30
+
31
+ ![Results Figure](images/video_r2_figure_2.png)
32
+
33
+ **Comparison of Video-R2 with recent video reasoning models, Video-R1, VideoChat-R1/1.5, and VideoRFT, across three metrics: TAC (Think–Answer Consistency), VAS (Video Attention Score), and Accuracy.**
34
+ The upper row reports average scores over six reasoning benchmarks, `VideoMathQA, Video-MMMU, MMVU, VSIBench, MINERVA, and SciVideoBench`, while the lower row shows averages over all 11 benchmarks including the five generic ones, `MVBench, VideoMME, TempCompass, MLVU, and LongVideoBench`. Video-R2 performs better across both reasoning and overall evaluations, achieving higher consistency (TAC) and video-focused reasoning
35
+ (VAS) while maintaining competitive accuracy.
36
+
37
+ ---
38
+
39
+ ## Key Ideas
40
+
41
+ Video-R2 combines two post-training stages:
42
+
43
+ 1. Timestamp-aware supervised fine-tuning (SFT) to encourage explicit temporal grounding
44
+ 2. Group Relative Policy Optimization (GRPO) to reinforce consistency and video reliance
45
+
46
+
47
+ ## Training Summary
48
+
49
+ **- Base model:** Qwen2.5-VL-Instruct (7B)
50
+
51
+ **- Stage 1:** Timestamp-aware SFT
52
+
53
+ **- Stage 2:** GRPO with Temporal Alignment Reward (TAR)
54
+
55
+ **- Training Dataset:** [MBZUAI/Video-R2-Dataset](https://huggingface.co/datasets/MBZUAI/Video-R2-Dataset)
56
+
57
+ For full training details, see the GitHub repository:
58
+ https://github.com/mbzuai-oryx/Video-R2
59
+
60
+ ## Evaluation
61
+
62
+ Video-R2 is evaluated on 11 benchmarks:
63
+ - 5 general benchmarks (MVBench, VideoMME, TempCompass, MLVU, LongVideoBench)
64
+ - 6 reasoning-focused benchmarks (VideoMathQA, Video-MMMU, MMVU, VSIBench, MINERVA, SciVideoBench)
65
+
66
+
67
+ ## Usage
68
+
69
+ A gradio demo is provided at: https://github.com/mbzuai-oryx/Video-R2/demo
70
+
71
+ ```python
72
+ cd demo
73
+ python gradio_demo.py --ckpt MBZUAI/Video-R2 --port 7860
74
+ ```
75
+
76
+ ## Citation ✏️
77
+
78
+ If you find Video-R2 helpful, please cite:
79
+
80
+ ```bibtex
81
+ @article{maaz2025video-r2,
82
+ title={Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models},
83
+ author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Fahad Shahbaz and Khan, Salman},
84
+ journal={arXiv preprint arXiv:2511.23478},
85
+ year={2025}
86
+ }
87
+ ```