Innovator-VL-8B-Thinking

Introduction

Innovator-VL-8B-Thinking is a multimodal reasoning-oriented large language model designed for complex scientific problem solving. Built upon Innovator-VL-8B-Instruct, this model is further optimized for explicit multi-step reasoning, long-horizon chain-of-thought generation, and token-efficient scientific analysis.

The model is particularly suitable for scientific tasks that require structured reasoning over visual and textual evidence, such as mathematics, chemistry, materials science, and multimodal scientific benchmarks.


Model Overview

  • Model Type: Vision-Language Reasoning Model
  • Parameter Size: 8B
  • Base Language Model: Qwen3-8B-Base
  • Vision Encoder: RICE-ViT
  • Projector: PatchMerger

The model supports native-resolution multi-image inputs and is optimized for reasoning-intensive multimodal scenarios.


Key Characteristics

Explicit Multimodal Reasoning

Innovator-VL-8B-Thinking is trained to explicitly generate structured reasoning traces, enabling the model to: - Perform multi-step logical deduction grounded in visual evidence - Solve complex mathematical and scientific problems - Maintain reasoning consistency across long contexts

Reinforcement Learning for Long-Horizon Reasoning

The model is further optimized using reinforcement learning to improve: - Reasoning correctness - Output consistency - Token efficiency in long chain-of-thought generation

Sequence-level optimization enables strong accuracy while significantly reducing unnecessary reasoning tokens.

Scientific Reasoning Performance

Compared to instruction-only models, Innovator-VL-8B-Thinking demonstrates substantial gains on: - Multimodal mathematical reasoning benchmarks - Scientific reasoning and domain-specific QA - Tasks requiring precise step-by-step analysis


Model Architecture

  • Vision Encoder: RICE-ViT (region-aware visual representation)
  • Projector: PatchMerger for visual token compression
  • Language Model: Qwen3-8B-Base
  • Model Size: 8B parameters

The architecture is shared with the Instruct variant, while the optimization objective and training strategy differ at the post-training stage.


Training Pipeline

Multimodal Pre-training

  • Vision-language alignment with LLaVA-1.5 (558K)
  • Full-parameter mid-training using LLaVA-OneVision-1.5 (85M)

Instruction Initialization

  • Initialized from Innovator-VL-8B-Instruct
  • Supervised fine-tuning with multimodal instruction and reasoning data

Reinforcement Learning

  • Trained with Innovator-VL-RL-172K
  • Optimized using Group Sequence Policy Optimization (GSPO)
  • Reward design jointly considers reasoning structure and answer correctness

Usage Recommendations

This model is recommended for: - Multimodal mathematical reasoning - Scientific problem solving requiring explicit reasoning - Evaluation settings emphasizing chain-of-thought quality

For general instruction-following or latency-sensitive applications, the Instruct version is recommended.


Citation

@article{innovator-vl, title={Innovator-VL: A Multimodal Large Language Model for Scientific Discovery}, year={2025} }
Downloads last month
18
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including InnovatorLab/Innovator-VL-8B-Thinking