---
datasets:
- lmms-lab/ai2d
- lmms-lab/POPE
- lmms-lab/VizWiz-VQA
- echo840/OCRBench
- MathLLMs/MathVision
metrics:
- accuracy
license: mit
language:
- zh
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
---
Introduction

We introduce TianJiangZhuG_3B, an advanced multimodal large language model (MLLM) that demonstrates superior overall performance. Additionally, we compare TianJiangZhuGe_3B with Qwen2.5-VL-3B-Instruct model, whose corresponding pre-trained base models are employed as the initialization of the langauge component in TianJiangZhuGe. Benefitting from Native Multimodal Pre-Training, the TianJiangZhuGe_3B achieves even better overall text performance than the Qwen2.5-VL-3B-Instruct.


Key Enhancements:

1. Meticulous Construction of High-Quality Chain-of-Thought (CoT) Datasets

   Scale and Coverage: We have systematically built thousands of high-quality Chinese and English reasoning data across multiple domains such as mathematical applications, logical reasoning, and symbolic operations. This ensures the model’s generalization ability in diverse scenarios.
   
   Data Generation Method: Based on selected image-text question-answer pairs, combined with the "Super Chain-of-Thought Model", we automatically generate Chain-of-Thought annotated data containing detailed reasoning paths. This method effectively enhances the model’s step-by-step reasoning and logical coherence
 
3. Multi-Stage GRPO Training Algorithm

    Progressive Learning Mechanism: We innovatively propose a multi-stage GRPO (Gradient-based Reward Policy Optimization) training process. Through task design that progresses from shallow to deep and from simple to complex, we guide the model to achieve stepwise capability evolution:

       Primary Stage: Focus on judgment and classification tasks to strengthen the model’s understanding of problem structures and basic logic.

       Intermediate Stage: Introduce multiple-choice and matching questions to improve the model’s ability to identify key information among distractors.

       Advanced Stage: Expand to open-ended generation tasks to encourage the model to conduct free deduction and complete logical expression.
    
    Algorithm Advantages: This training strategy effectively reduces the model’s learning difficulty in complex tasks, improves training stability and strategy convergence efficiency, while significantly enhancing the model’s adaptability across different task types.
     

Evaluation:

![image](https://cdn-uploads.huggingface.co/production/uploads/665586fe6e0ff091acbf0af1/Pwj5PrL7VMssvyw_EkQvV.png)

markdown
|Benchmark|Qwen2.5-VL-3B|TianJiangZhuGe-3B|
|----------|----------|--------|
|POPE|0.7676|0.8|
|ai2d|0.6343|0.6833|
|vizwiz_val|0.6099|0.6062|
|MathVision|22.86|22.14|
|OCRBench|68.1|71.4|
|MathVista|40.8|44.4|

Using Transformers to Chat:

markdown

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
messages = [[{"role": "user", "content": [{"type": "image", "image": "file:///path/to/image1.jpg"}, {"type": "text", "text": "Describe this image."}]}],]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

model_path = '/nfs4/models/Tianjiangzhuge'
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, dtype=torch.float16).to(device)
inputs = processor(text=text, images=images, videos=videos, padding=True, return_tensors="pt", **video_kwargs).to(device)

generated_ids = model.generate(**inputs)
```
Multi image inference:

markdown

```python
# Messages containing multiple images and a text query
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Describe the difference between these images."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
```

---
license: mit
language:
- en
- zh
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
---