Zhang199
/

TinyLLaVA-Video-R1

Video-Text-to-Text

text-generation

Model card Files Files and versions

Metrics Training metrics Community

TinyLLaVA-Video-R1 / README.md

Zhang199's picture

Update README.md

1c18ef0 verified 10 months ago

|

history blame contribute delete

1.16 kB

	---
	license: apache-2.0
	pipeline_tag: video-text-to-text
	library_name: transformers
	---

	<center><span style="font-size:2em;">TinyLLaVA-Video-R1</span></center>

	[![arXiv](https://img.shields.io/badge/Arxiv-2504.09641-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2504.09641)[![Github](https://img.shields.io/badge/Github-Github-blue.svg)](https://github.com/ZhangXJ199/TinyLLaVA-Video-R1)

	Here, we introduce a small-scale video reasoning model TinyLLaVA-Video-R1, based on the traceably trained model [TinyLLaVA-Video](https://github.com/ZhangXJ199/TinyLLaVA-Video). After reinforcement learning on general Video-QA datasets, the model not only significantly improves its reasoning and thinking abilities, but also exhibits the emergent characteristic of “aha moments”.

	### Result
	\| Model (HF Path) \| Video-MME(wo sub) \| MVBench \| MLVU \| MMVU(mc) \|
	\| :----------------------------------------: \| :-------------: \| :-------: \| :--------------: \| :----------: \|
	\| [Zhang199/TinyLLaVA-Video-R1](https://huggingface.co/Zhang199/TinyLLaVA-Video-R1) \| 46.6 \| 49.5 \| 52.4 \| 46.9 \|