Innovator-VL-8B-Thinking
Introduction
Innovator-VL-8B-Thinking is a multimodal reasoning-oriented large language model designed for complex scientific problem solving. Built upon Innovator-VL-8B-Instruct, this model is further optimized for explicit multi-step reasoning, long-horizon chain-of-thought generation, and token-efficient scientific analysis.
The model is particularly suitable for scientific tasks that require structured reasoning over visual and textual evidence, such as mathematics, chemistry, materials science, and multimodal scientific benchmarks.
Model Overview
- Model Type: Vision-Language Reasoning Model
- Parameter Size: 8B
- Base Language Model: Qwen3-8B-Base
- Vision Encoder: RICE-ViT
- Projector: PatchMerger
The model supports native-resolution multi-image inputs and is optimized for reasoning-intensive multimodal scenarios.
Key Characteristics
Explicit Multimodal Reasoning
Innovator-VL-8B-Thinking is trained to explicitly generate structured reasoning traces, enabling the model to: - Perform multi-step logical deduction grounded in visual evidence - Solve complex mathematical and scientific problems - Maintain reasoning consistency across long contexts
Reinforcement Learning for Long-Horizon Reasoning
The model is further optimized using reinforcement learning to improve: - Reasoning correctness - Output consistency - Token efficiency in long chain-of-thought generation
Sequence-level optimization enables strong accuracy while significantly reducing unnecessary reasoning tokens.
Scientific Reasoning Performance
Compared to instruction-only models, Innovator-VL-8B-Thinking demonstrates substantial gains on: - Multimodal mathematical reasoning benchmarks - Scientific reasoning and domain-specific QA - Tasks requiring precise step-by-step analysis
Model Architecture
- Vision Encoder: RICE-ViT (region-aware visual representation)
- Projector: PatchMerger for visual token compression
- Language Model: Qwen3-8B-Base
- Model Size: 8B parameters
The architecture is shared with the Instruct variant, while the optimization objective and training strategy differ at the post-training stage.
Training Pipeline
Multimodal Pre-training
- Vision-language alignment with LLaVA-1.5 (558K)
- Full-parameter mid-training using LLaVA-OneVision-1.5 (85M)
Instruction Initialization
- Initialized from Innovator-VL-8B-Instruct
- Supervised fine-tuning with multimodal instruction and reasoning data
Reinforcement Learning
- Trained with Innovator-VL-RL-172K
- Optimized using Group Sequence Policy Optimization (GSPO)
- Reward design jointly considers reasoning structure and answer correctness
Usage Recommendations
This model is recommended for: - Multimodal mathematical reasoning - Scientific problem solving requiring explicit reasoning - Evaluation settings emphasizing chain-of-thought quality
For general instruction-following or latency-sensitive applications, the Instruct version is recommended.
Citation
@article{innovator-vl, title={Innovator-VL: A Multimodal Large Language Model for Scientific Discovery}, year={2025} }
- Downloads last month
- 18