Update README.md
Browse files
README.md
CHANGED
|
@@ -21,6 +21,24 @@ We introduce TianJiangZhuG_3B, an advanced multimodal large language model (MLLM
|
|
| 21 |
|
| 22 |
Key Enhancements:
|
| 23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
|
| 26 |
Evaluation:
|
|
|
|
| 21 |
|
| 22 |
Key Enhancements:
|
| 23 |
|
| 24 |
+
1. Meticulous Construction of High-Quality Chain-of-Thought (CoT) Datasets
|
| 25 |
+
|
| 26 |
+
Scale and Coverage: We have systematically built thousands of high-quality Chinese and English reasoning data across multiple domains such as mathematical applications, logical reasoning, and symbolic operations. This ensures the model’s generalization ability in diverse scenarios.
|
| 27 |
+
|
| 28 |
+
Data Generation Method: Based on selected image-text question-answer pairs, combined with the "Super Chain-of-Thought Model", we automatically generate Chain-of-Thought annotated data containing detailed reasoning paths. This method effectively enhances the model’s step-by-step reasoning and logical coherence
|
| 29 |
+
|
| 30 |
+
3. Multi-Stage GRPO Training Algorithm
|
| 31 |
+
|
| 32 |
+
Progressive Learning Mechanism: We innovatively propose a multi-stage GRPO (Gradient-based Reward Policy Optimization) training process. Through task design that progresses from shallow to deep and from simple to complex, we guide the model to achieve stepwise capability evolution:
|
| 33 |
+
|
| 34 |
+
Primary Stage: Focus on judgment and classification tasks to strengthen the model’s understanding of problem structures and basic logic.
|
| 35 |
+
|
| 36 |
+
Intermediate Stage: Introduce multiple-choice and matching questions to improve the model’s ability to identify key information among distractors.
|
| 37 |
+
|
| 38 |
+
Advanced Stage: Expand to open-ended generation tasks to encourage the model to conduct free deduction and complete logical expression.
|
| 39 |
+
|
| 40 |
+
Algorithm Advantages: This training strategy effectively reduces the model’s learning difficulty in complex tasks, improves training stability and strategy convergence efficiency, while significantly enhancing the model’s adaptability across different task types.
|
| 41 |
+
|
| 42 |
|
| 43 |
|
| 44 |
Evaluation:
|