V-Thinker / README.md
YanmHa's picture
Create README.md
3e864ce verified
|
raw
history blame
1.52 kB
---
license: mit
---
## 💡 Overview
> *"The soul never thinks without an image." — Aristotle*
**V-Thinker** is a general-purpose multimodal reasoning assistant that enables **Interactive Thinking with Images** through end-to-end reinforcement learning. Unlike traditional vision-language models, V-Thinker actively **interacts** with visual content—editing, annotating, and transforming images to simplify complex problems.
```bash
import torch
import os
import json
import argparse
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor, AutoConfig, Qwen3VLForConditionalGeneration
from tqdm import tqdm
from utils import run_evaluation # Assuming you have this utility function
MODEL_PATH=""
config = AutoConfig.from_pretrained(MODEL_PATH)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MODEL_PATH,
device_map="auto", # "auto" works perfectly with CUDA_VISIBLE_DEVICES
config=config
)
processor = AutoProcessor.from_pretrained(MODEL_PATH)
question_text = "Question: Hint: Please answer the question and provide the final answer at the end.\nQuestion: How many lines of symmetry does this figure have?\n\n\nPlease provide the final answer in the format <answer>X</answer>"
image_path = "./224.png"
# Construct the full, normalized image pat
final_assistant_response, final_answer, aux_path = run_evaluation(question_text, image_path, "./", model, processor)
print("Model Response")
print(final_answer)
print("auxiliary path")
print(final_answer)
```