--- datasets: - lmms-lab/ai2d - lmms-lab/POPE - lmms-lab/VizWiz-VQA - echo840/OCRBench - MathLLMs/MathVision metrics: - accuracy license: mit language: - zh - en base_model: - Qwen/Qwen2.5-VL-3B-Instruct --- Introduction We introduce TianJiangZhuG_3B, an advanced multimodal large language model (MLLM) that demonstrates superior overall performance. Additionally, we compare TianJiangZhuGe_3B with Qwen2.5-VL-3B-Instruct model, whose corresponding pre-trained base models are employed as the initialization of the langauge component in TianJiangZhuGe. Benefitting from Native Multimodal Pre-Training, the TianJiangZhuGe_3B achieves even better overall text performance than the Qwen2.5-VL-3B-Instruct. Key Enhancements: 1. Meticulous Construction of High-Quality Chain-of-Thought (CoT) Datasets Scale and Coverage: We have systematically built thousands of high-quality Chinese and English reasoning data across multiple domains such as mathematical applications, logical reasoning, and symbolic operations. This ensures the model’s generalization ability in diverse scenarios. Data Generation Method: Based on selected image-text question-answer pairs, combined with the "Super Chain-of-Thought Model", we automatically generate Chain-of-Thought annotated data containing detailed reasoning paths. This method effectively enhances the model’s step-by-step reasoning and logical coherence 3. Multi-Stage GRPO Training Algorithm Progressive Learning Mechanism: We innovatively propose a multi-stage GRPO (Gradient-based Reward Policy Optimization) training process. Through task design that progresses from shallow to deep and from simple to complex, we guide the model to achieve stepwise capability evolution: Primary Stage: Focus on judgment and classification tasks to strengthen the model’s understanding of problem structures and basic logic. Intermediate Stage: Introduce multiple-choice and matching questions to improve the model’s ability to identify key information among distractors. Advanced Stage: Expand to open-ended generation tasks to encourage the model to conduct free deduction and complete logical expression. Algorithm Advantages: This training strategy effectively reduces the model’s learning difficulty in complex tasks, improves training stability and strategy convergence efficiency, while significantly enhancing the model’s adaptability across different task types. Evaluation: ![image](https://cdn-uploads.huggingface.co/production/uploads/665586fe6e0ff091acbf0af1/Pwj5PrL7VMssvyw_EkQvV.png) markdown |Benchmark|Qwen2.5-VL-3B|TianJiangZhuGe-3B| |----------|----------|--------| |POPE|0.7676|0.8| |ai2d|0.6343|0.6833| |vizwiz_val|0.6099|0.6062| |MathVision|22.86|22.14| |OCRBench|68.1|71.4| |MathVista|40.8|44.4| Using Transformers to Chat: markdown ```python from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor from transformers import Qwen2VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info device = torch.device("cuda" if torch.cuda.is_available() else "cpu") messages = [[{"role": "user", "content": [{"type": "image", "image": "file:///path/to/image1.jpg"}, {"type": "text", "text": "Describe this image."}]}],] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) model_path = '/nfs4/models/Tianjiangzhuge' model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, dtype=torch.float16).to(device) inputs = processor(text=text, images=images, videos=videos, padding=True, return_tensors="pt", **video_kwargs).to(device) generated_ids = model.generate(**inputs) ``` Multi image inference: markdown ```python # Messages containing multiple images and a text query messages = [ { "role": "user", "content": [ {"type": "image", "image": "file:///path/to/image1.jpg"}, {"type": "image", "image": "file:///path/to/image2.jpg"}, {"type": "text", "text": "Describe the difference between these images."}, ], } ] text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) ``` --- license: mit language: - en - zh base_model: - Qwen/Qwen2.5-VL-3B-Instruct ---