yuhangzang commited on
Commit
8a2e07e
Β·
verified Β·
1 Parent(s): ec60e9f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -16
README.md CHANGED
@@ -7,6 +7,9 @@ pipeline_tag: image-text-to-text
7
  tags:
8
  - multimodal
9
  - image caption
 
 
 
10
  ---
11
 
12
  # CapRL-3B
@@ -16,21 +19,23 @@ tags:
16
 
17
  πŸ€—<a href="https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189">CapRL Collection</a> | πŸ€—<a href="https://huggingface.co/papers/2509.22647">Daily Paper</a> ο½œπŸ€—<a href="https://huggingface.co/mradermacher/CapRL-3B-GGUF">CapRL-3B-GGUF</a> ο½œπŸ€—<a href="https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF">CapRL-3B-i1-GGUF</a>
18
 
 
 
 
 
 
 
 
19
  ## πŸ“’ News
20
  We are working on even stronger base models and upgrading our training recipe β€” stay tuned!
21
- - πŸ”₯ [10/15/2025] The total downloads of the CapRL-related model and dataset reached 6,000 within just 20 days!
22
- - πŸš€ [10/15/2025] We are excited to announce the release of **CapRL-InternVL3.5-8B**, whose image captioning capability outperforms Qwen2.5-VL-72B!
23
- - πŸš€ [10/15/2025] We release QA curation code.
24
- - πŸš€ [09/25/2025] We release **CapRL** repository, model, evaluation code and dataset.
25
-
26
- Based on the same recipe as CapRL-3B, we used InternVL3.5-8B as the policy model and obtained **CapRL-InternVL3.5-8B** through CapRL.
27
-
28
-
29
- CapRL-3B-GGUF is static quants version, and CapRL-3B-i1-GGUF is weighted/imatrix quants version. Thanks for their contribution!
30
-
31
 
32
  ## Introduction
33
- We are excited to introduce CapRL-3B, a lightweight 3B image captioner that achieves perception capabilities comparable to Qwen2.5-VL-72B.
34
 
35
  This is the first study of applying Reinforcement Learning with Verifiable Rewards for the
36
  open-ended and subjective image captioning task. Unlike traditional Supervised Fine-Tuning, which
@@ -41,8 +46,8 @@ stage uses LVLMs to generate rich and accurate captions. Subsequently, the secon
41
  caption quality by using a vision-only LLM to perform the QA task. We also created a specific QA
42
  curation pipeline to ensure the quality of the questions and answers used for the second stage.
43
 
44
- By employing CapRL training framework, initializing with the Qwen2.5-VL-3B model, and using a carefully
45
- filtered 75K QA dataset as the training set, we obtained a highly capable captioner, CapRL-3B.
46
 
47
  <p align="center">
48
  <img src="./assets/teaser.png" width="750"/>
@@ -52,12 +57,12 @@ filtered 75K QA dataset as the training set, we obtained a highly capable captio
52
  </p>
53
 
54
  ## Key Features
55
- * **Remarkable visual understanding for Chart, Infographics and Document**: CapRL-3B achieves perception accuracy and visual information coverage comparable to Qwen2.5-VL-72B.
56
  * **Well-organized output**: The outputs of CapRL-3B are relatively well-structured, making them clear and easy to understand.
57
- * **Detailed description for natural images**: The outputs of CapRL-3B can perfectly cover all valid visual information while containing fewer hallucinations.
58
 
59
  ## Usage
60
- If you want to use **CapRL-3B** for captioning, you can directly follow the exact same inference approach as in [Qwen2.5-VL-series](https://github.com/QwenLM/Qwen3-VL/tree/d2240f11656bfe404b9ba56db4e51cd09f522ff1).
61
 
62
  We recommend using **vLLM** to speed up inference.
63
 
 
7
  tags:
8
  - multimodal
9
  - image caption
10
+ - captioning
11
+ datasets:
12
+ - internlm/CapRL-2M
13
  ---
14
 
15
  # CapRL-3B
 
19
 
20
  πŸ€—<a href="https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189">CapRL Collection</a> | πŸ€—<a href="https://huggingface.co/papers/2509.22647">Daily Paper</a> ο½œπŸ€—<a href="https://huggingface.co/mradermacher/CapRL-3B-GGUF">CapRL-3B-GGUF</a> ο½œπŸ€—<a href="https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF">CapRL-3B-i1-GGUF</a>
21
 
22
+ When selecting between the available CapRL models, it's essential to consider the trade-off between performance and computational cost.
23
+ This guide will help you choose the most suitable model for your specific needs:
24
+ |Model|Parameters|Strength|
25
+ |-|-|-|
26
+ |πŸ€—[CapRL-3B](https://huggingface.co/internlm/CapRL-3B)|3B|Speed, Efficiency|
27
+ |πŸ€—[CapRL-InternVL3.5-8B](https://huggingface.co/yuhangzang/CapRL-InternVL3.5-8B)|3B|High Performance, Advanced Captioning Ability|
28
+
29
  ## πŸ“’ News
30
  We are working on even stronger base models and upgrading our training recipe β€” stay tuned!
31
+ - πŸ”₯ [10/15/2025] The total downloads of the CapRL-related [models and dataset](https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189) reached 6,000 within just 20 days!
32
+ - πŸš€ [10/15/2025] We are excited to announce the release of **[CapRL-InternVL3.5-8B](https://huggingface.co/internlm/CapRL-InternVL3.5-8B)**, whose image captioning capability outperforms Qwen2.5-VL-72B!
33
+ - πŸš€ [10/15/2025] Thanks [mradermacher](https://huggingface.co/mradermacher) for the valuable contribution! [CapRL-3B-GGUF](https://huggingface.co/mradermacher/CapRL-3B-GGUF) is the static quants version, and [CapRL-3B-i1-GGUF](https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF) is weighted/imatrix quants version.
34
+ - πŸš€ [10/15/2025] We release [QA curation code](https://github.com/InternLM/CapRL).
35
+ - πŸš€ [09/25/2025] We release **CapRL** repository, [CapRL-3B model](https://huggingface.co/internlm/CapRL-3B), [evaluation code](https://github.com/InternLM/CapRL) and [dataset](https://huggingface.co/datasets/internlm/CapRL-2M).
 
 
 
 
 
36
 
37
  ## Introduction
38
+ We are excited to introduce [CapRL-3B](https://huggingface.co/internlm/CapRL-3B), a lightweight 3B image captioner that achieves perception capabilities comparable to Qwen2.5-VL-72B.
39
 
40
  This is the first study of applying Reinforcement Learning with Verifiable Rewards for the
41
  open-ended and subjective image captioning task. Unlike traditional Supervised Fine-Tuning, which
 
46
  caption quality by using a vision-only LLM to perform the QA task. We also created a specific QA
47
  curation pipeline to ensure the quality of the questions and answers used for the second stage.
48
 
49
+ By employing the CapRL training framework, initializing with the Qwen2.5-VL-3B model, and using a carefully
50
+ filtered 75K QA dataset as the training set, we obtained a highly capable captioner, [CapRL-3B](https://huggingface.co/internlm/CapRL-3B).
51
 
52
  <p align="center">
53
  <img src="./assets/teaser.png" width="750"/>
 
57
  </p>
58
 
59
  ## Key Features
60
+ * **Remarkable visual understanding for Chart, Infographics and Document**: [CapRL-3B](https://huggingface.co/internlm/CapRL-3B) achieves perception accuracy and visual information coverage comparable to Qwen2.5-VL-72B.
61
  * **Well-organized output**: The outputs of CapRL-3B are relatively well-structured, making them clear and easy to understand.
62
+ * **Detailed description for natural images**: The outputs of [CapRL-3B](https://huggingface.co/internlm/CapRL-3B) can perfectly cover all valid visual information while containing fewer hallucinations.
63
 
64
  ## Usage
65
+ If you want to use **[CapRL-3B](https://huggingface.co/internlm/CapRL-3B)** for captioning, you can directly follow the exact same inference approach as in [Qwen2.5-VL-series](https://github.com/QwenLM/Qwen3-VL/tree/d2240f11656bfe404b9ba56db4e51cd09f522ff1).
66
 
67
  We recommend using **vLLM** to speed up inference.
68