Update README.md
Browse files
README.md
CHANGED
|
@@ -1,37 +1,10 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
-
|
| 5 |
|
| 6 |
-
|
| 7 |
|
| 8 |
-
|
| 9 |
-
```bash
|
| 10 |
-
import torch
|
| 11 |
-
import os
|
| 12 |
-
import json
|
| 13 |
-
import argparse
|
| 14 |
-
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor, AutoConfig, Qwen3VLForConditionalGeneration
|
| 15 |
-
from tqdm import tqdm
|
| 16 |
-
from utils import run_evaluation # Assuming you have this utility function
|
| 17 |
-
MODEL_PATH=""
|
| 18 |
|
| 19 |
-
|
| 20 |
-
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
| 21 |
-
MODEL_PATH,
|
| 22 |
-
device_map="auto", # "auto" works perfectly with CUDA_VISIBLE_DEVICES
|
| 23 |
-
config=config
|
| 24 |
-
)
|
| 25 |
-
processor = AutoProcessor.from_pretrained(MODEL_PATH)
|
| 26 |
-
|
| 27 |
-
question_text = "Question: Hint: Please answer the question and provide the final answer at the end.\nQuestion: How many lines of symmetry does this figure have?\n\n\nPlease provide the final answer in the format <answer>X</answer>"
|
| 28 |
-
image_path = "./224.png"
|
| 29 |
-
|
| 30 |
-
# Construct the full, normalized image pat
|
| 31 |
-
final_assistant_response, final_answer, aux_path = run_evaluation(question_text, image_path, "./", model, processor)
|
| 32 |
-
print("Model Response")
|
| 33 |
-
print(final_answer)
|
| 34 |
-
print("auxiliary path")
|
| 35 |
-
print(final_answer)
|
| 36 |
-
|
| 37 |
-
```
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
+
# V-Thinker: Interactive Thinking with Images
|
| 5 |
|
| 6 |
+
The model checkpoint of ARPO is released for the paper V-Thinker: Interactive Thinking with Images
|
| 7 |
|
| 8 |
+
## Abstract
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
+
Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising “Thinking with Images” paradigm for LMMs, profoundly shifting from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by narrow visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions — diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|