We-Math commited on
Commit
f1bf58d
·
verified ·
1 Parent(s): 97456a3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -31
README.md CHANGED
@@ -1,37 +1,10 @@
1
  ---
2
  license: mit
3
  ---
4
- ## 💡 Overview
5
 
6
- > *"The soul never thinks without an image." Aristotle*
7
 
8
- **V-Thinker** is a general-purpose multimodal reasoning assistant that enables **Interactive Thinking with Images** through end-to-end reinforcement learning. Unlike traditional vision-language models, V-Thinker actively **interacts** with visual content—editing, annotating, and transforming images to simplify complex problems.
9
- ```bash
10
- import torch
11
- import os
12
- import json
13
- import argparse
14
- from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor, AutoConfig, Qwen3VLForConditionalGeneration
15
- from tqdm import tqdm
16
- from utils import run_evaluation # Assuming you have this utility function
17
- MODEL_PATH=""
18
 
19
- config = AutoConfig.from_pretrained(MODEL_PATH)
20
- model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
21
- MODEL_PATH,
22
- device_map="auto", # "auto" works perfectly with CUDA_VISIBLE_DEVICES
23
- config=config
24
- )
25
- processor = AutoProcessor.from_pretrained(MODEL_PATH)
26
-
27
- question_text = "Question: Hint: Please answer the question and provide the final answer at the end.\nQuestion: How many lines of symmetry does this figure have?\n\n\nPlease provide the final answer in the format <answer>X</answer>"
28
- image_path = "./224.png"
29
-
30
- # Construct the full, normalized image pat
31
- final_assistant_response, final_answer, aux_path = run_evaluation(question_text, image_path, "./", model, processor)
32
- print("Model Response")
33
- print(final_answer)
34
- print("auxiliary path")
35
- print(final_answer)
36
-
37
- ```
 
1
  ---
2
  license: mit
3
  ---
4
+ # V-Thinker: Interactive Thinking with Images
5
 
6
+ The model checkpoint of ARPO is released for the paper V-Thinker: Interactive Thinking with Images
7
 
8
+ ## Abstract
 
 
 
 
 
 
 
 
 
9
 
10
+ Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising “Thinking with Images” paradigm for LMMs, profoundly shifting from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by narrow visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions — diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.