Update README.md
Browse files
README.md
CHANGED
|
@@ -7,6 +7,9 @@ pipeline_tag: image-text-to-text
|
|
| 7 |
tags:
|
| 8 |
- multimodal
|
| 9 |
- image caption
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
# CapRL-3B
|
|
@@ -16,21 +19,23 @@ tags:
|
|
| 16 |
|
| 17 |
π€<a href="https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189">CapRL Collection</a> | π€<a href="https://huggingface.co/papers/2509.22647">Daily Paper</a> ο½π€<a href="https://huggingface.co/mradermacher/CapRL-3B-GGUF">CapRL-3B-GGUF</a> ο½π€<a href="https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF">CapRL-3B-i1-GGUF</a>
|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
## π’ News
|
| 20 |
We are working on even stronger base models and upgrading our training recipe β stay tuned!
|
| 21 |
-
- π₯ [10/15/2025] The total downloads of the CapRL-related
|
| 22 |
-
- π [10/15/2025] We are excited to announce the release of **CapRL-InternVL3.5-8B**, whose image captioning capability outperforms Qwen2.5-VL-72B!
|
| 23 |
-
- π [10/15/2025]
|
| 24 |
-
- π [
|
| 25 |
-
|
| 26 |
-
Based on the same recipe as CapRL-3B, we used InternVL3.5-8B as the policy model and obtained **CapRL-InternVL3.5-8B** through CapRL.
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
CapRL-3B-GGUF is static quants version, and CapRL-3B-i1-GGUF is weighted/imatrix quants version. Thanks for their contribution!
|
| 30 |
-
|
| 31 |
|
| 32 |
## Introduction
|
| 33 |
-
We are excited to introduce CapRL-3B, a lightweight 3B image captioner that achieves perception capabilities comparable to Qwen2.5-VL-72B.
|
| 34 |
|
| 35 |
This is the first study of applying Reinforcement Learning with Verifiable Rewards for the
|
| 36 |
open-ended and subjective image captioning task. Unlike traditional Supervised Fine-Tuning, which
|
|
@@ -41,8 +46,8 @@ stage uses LVLMs to generate rich and accurate captions. Subsequently, the secon
|
|
| 41 |
caption quality by using a vision-only LLM to perform the QA task. We also created a specific QA
|
| 42 |
curation pipeline to ensure the quality of the questions and answers used for the second stage.
|
| 43 |
|
| 44 |
-
By employing CapRL training framework, initializing with the Qwen2.5-VL-3B model, and using a carefully
|
| 45 |
-
filtered 75K QA dataset as the training set, we obtained a highly capable captioner, CapRL-3B.
|
| 46 |
|
| 47 |
<p align="center">
|
| 48 |
<img src="./assets/teaser.png" width="750"/>
|
|
@@ -52,12 +57,12 @@ filtered 75K QA dataset as the training set, we obtained a highly capable captio
|
|
| 52 |
</p>
|
| 53 |
|
| 54 |
## Key Features
|
| 55 |
-
* **Remarkable visual understanding for Chart, Infographics and Document**: CapRL-3B achieves perception accuracy and visual information coverage comparable to Qwen2.5-VL-72B.
|
| 56 |
* **Well-organized output**: The outputs of CapRL-3B are relatively well-structured, making them clear and easy to understand.
|
| 57 |
-
* **Detailed description for natural images**: The outputs of CapRL-3B can perfectly cover all valid visual information while containing fewer hallucinations.
|
| 58 |
|
| 59 |
## Usage
|
| 60 |
-
If you want to use **CapRL-3B** for captioning, you can directly follow the exact same inference approach as in [Qwen2.5-VL-series](https://github.com/QwenLM/Qwen3-VL/tree/d2240f11656bfe404b9ba56db4e51cd09f522ff1).
|
| 61 |
|
| 62 |
We recommend using **vLLM** to speed up inference.
|
| 63 |
|
|
|
|
| 7 |
tags:
|
| 8 |
- multimodal
|
| 9 |
- image caption
|
| 10 |
+
- captioning
|
| 11 |
+
datasets:
|
| 12 |
+
- internlm/CapRL-2M
|
| 13 |
---
|
| 14 |
|
| 15 |
# CapRL-3B
|
|
|
|
| 19 |
|
| 20 |
π€<a href="https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189">CapRL Collection</a> | π€<a href="https://huggingface.co/papers/2509.22647">Daily Paper</a> ο½π€<a href="https://huggingface.co/mradermacher/CapRL-3B-GGUF">CapRL-3B-GGUF</a> ο½π€<a href="https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF">CapRL-3B-i1-GGUF</a>
|
| 21 |
|
| 22 |
+
When selecting between the available CapRL models, it's essential to consider the trade-off between performance and computational cost.
|
| 23 |
+
This guide will help you choose the most suitable model for your specific needs:
|
| 24 |
+
|Model|Parameters|Strength|
|
| 25 |
+
|-|-|-|
|
| 26 |
+
|π€[CapRL-3B](https://huggingface.co/internlm/CapRL-3B)|3B|Speed, Efficiency|
|
| 27 |
+
|π€[CapRL-InternVL3.5-8B](https://huggingface.co/yuhangzang/CapRL-InternVL3.5-8B)|3B|High Performance, Advanced Captioning Ability|
|
| 28 |
+
|
| 29 |
## π’ News
|
| 30 |
We are working on even stronger base models and upgrading our training recipe β stay tuned!
|
| 31 |
+
- π₯ [10/15/2025] The total downloads of the CapRL-related [models and dataset](https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189) reached 6,000 within just 20 days!
|
| 32 |
+
- π [10/15/2025] We are excited to announce the release of **[CapRL-InternVL3.5-8B](https://huggingface.co/internlm/CapRL-InternVL3.5-8B)**, whose image captioning capability outperforms Qwen2.5-VL-72B!
|
| 33 |
+
- π [10/15/2025] Thanks [mradermacher](https://huggingface.co/mradermacher) for the valuable contribution! [CapRL-3B-GGUF](https://huggingface.co/mradermacher/CapRL-3B-GGUF) is the static quants version, and [CapRL-3B-i1-GGUF](https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF) is weighted/imatrix quants version.
|
| 34 |
+
- π [10/15/2025] We release [QA curation code](https://github.com/InternLM/CapRL).
|
| 35 |
+
- π [09/25/2025] We release **CapRL** repository, [CapRL-3B model](https://huggingface.co/internlm/CapRL-3B), [evaluation code](https://github.com/InternLM/CapRL) and [dataset](https://huggingface.co/datasets/internlm/CapRL-2M).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
## Introduction
|
| 38 |
+
We are excited to introduce [CapRL-3B](https://huggingface.co/internlm/CapRL-3B), a lightweight 3B image captioner that achieves perception capabilities comparable to Qwen2.5-VL-72B.
|
| 39 |
|
| 40 |
This is the first study of applying Reinforcement Learning with Verifiable Rewards for the
|
| 41 |
open-ended and subjective image captioning task. Unlike traditional Supervised Fine-Tuning, which
|
|
|
|
| 46 |
caption quality by using a vision-only LLM to perform the QA task. We also created a specific QA
|
| 47 |
curation pipeline to ensure the quality of the questions and answers used for the second stage.
|
| 48 |
|
| 49 |
+
By employing the CapRL training framework, initializing with the Qwen2.5-VL-3B model, and using a carefully
|
| 50 |
+
filtered 75K QA dataset as the training set, we obtained a highly capable captioner, [CapRL-3B](https://huggingface.co/internlm/CapRL-3B).
|
| 51 |
|
| 52 |
<p align="center">
|
| 53 |
<img src="./assets/teaser.png" width="750"/>
|
|
|
|
| 57 |
</p>
|
| 58 |
|
| 59 |
## Key Features
|
| 60 |
+
* **Remarkable visual understanding for Chart, Infographics and Document**: [CapRL-3B](https://huggingface.co/internlm/CapRL-3B) achieves perception accuracy and visual information coverage comparable to Qwen2.5-VL-72B.
|
| 61 |
* **Well-organized output**: The outputs of CapRL-3B are relatively well-structured, making them clear and easy to understand.
|
| 62 |
+
* **Detailed description for natural images**: The outputs of [CapRL-3B](https://huggingface.co/internlm/CapRL-3B) can perfectly cover all valid visual information while containing fewer hallucinations.
|
| 63 |
|
| 64 |
## Usage
|
| 65 |
+
If you want to use **[CapRL-3B](https://huggingface.co/internlm/CapRL-3B)** for captioning, you can directly follow the exact same inference approach as in [Qwen2.5-VL-series](https://github.com/QwenLM/Qwen3-VL/tree/d2240f11656bfe404b9ba56db4e51cd09f522ff1).
|
| 66 |
|
| 67 |
We recommend using **vLLM** to speed up inference.
|
| 68 |
|