|
|
--- |
|
|
license: mit |
|
|
--- |
|
|
## 💡 Overview |
|
|
|
|
|
> *"The soul never thinks without an image." — Aristotle* |
|
|
|
|
|
**V-Thinker** is a general-purpose multimodal reasoning assistant that enables **Interactive Thinking with Images** through end-to-end reinforcement learning. Unlike traditional vision-language models, V-Thinker actively **interacts** with visual content—editing, annotating, and transforming images to simplify complex problems. |
|
|
```bash |
|
|
import torch |
|
|
import os |
|
|
import json |
|
|
import argparse |
|
|
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor, AutoConfig, Qwen3VLForConditionalGeneration |
|
|
from tqdm import tqdm |
|
|
from utils import run_evaluation # Assuming you have this utility function |
|
|
MODEL_PATH="" |
|
|
|
|
|
config = AutoConfig.from_pretrained(MODEL_PATH) |
|
|
model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
|
|
MODEL_PATH, |
|
|
device_map="auto", # "auto" works perfectly with CUDA_VISIBLE_DEVICES |
|
|
config=config |
|
|
) |
|
|
processor = AutoProcessor.from_pretrained(MODEL_PATH) |
|
|
|
|
|
question_text = "Question: Hint: Please answer the question and provide the final answer at the end.\nQuestion: How many lines of symmetry does this figure have?\n\n\nPlease provide the final answer in the format <answer>X</answer>" |
|
|
image_path = "./224.png" |
|
|
|
|
|
# Construct the full, normalized image pat |
|
|
final_assistant_response, final_answer, aux_path = run_evaluation(question_text, image_path, "./", model, processor) |
|
|
print("Model Response") |
|
|
print(final_answer) |
|
|
print("auxiliary path") |
|
|
print(final_answer) |
|
|
|
|
|
``` |