turningpoint-ai
/

VisualThinker-R1-Zero

@@ -107,17 +107,52 @@ pip install qwen-vl-utils
 ## 💻 Model Downloads and Usage
 ```
-# Load model directly
 from transformers import AutoProcessor, AutoModelForImageTextToText
-processor = AutoProcessor.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero")
-model = AutoModelForImageTextToText.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero")
-# Prepare input
 ```
-## 📰 Evaluation Results
 ### DeepSeek-R1-Evaluation
  For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.
@@ -151,7 +186,7 @@ model = AutoModelForImageTextToText.from_pretrained("turningpoint-ai/VisualThink
 | | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | **91.8** |
 | | C-SimpleQA (Correct) | 55.4 | 58.7 | **68.0** | 40.3 | - | 63.7 |
-</div>
 ## 🙌 Stay Connected!

 ## 💻 Model Downloads and Usage
 ```
+from PIL import Image
+import requests
+from io import BytesIO
 from transformers import AutoProcessor, AutoModelForImageTextToText
+# Load model directly
+processor = AutoProcessor.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero")
+model = AutoModelForImageTextToText.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero",
+  , torch_dtype="auto", device_map="auto")
+model.eval()
+# Prepare image input
+image_url = "https://huggingface.co/datasets/array/SAT/viewer/default/validation?row=2&image-viewer=1FECF8A4A7380558FF5C3E659A8D54DB721032AF"
+# Prepare text input
+question = "Answer in natural language. I need to go to Chair (near the mark 7 in the image). Which direction should I turn to face the object? look straight or left by 40 degrees."
+prompt = f"A conversation between User and Assistant. The user asks a question about the image, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: {question} \nAssistant: Let me solve this step by step.\n<think>"
+# Process input
+response = requests.get(image_url)
+image = Image.open(BytesIO(response.content))
+text = processor.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
+input = processor(
+                text=text,
+                images=image,
+                padding=True,
+                return_tensors="pt",
+            )
+input = inputsto("cuda")
+# Generation of the output
+with torch.no_grad():
+    generated_ids = model.module.generate(**input, use_cache=True, max_new_tokens=1024, do_sample=True)
+generated_ids_trimmed = [
+    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+batch_output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+# Get output
+output_text = batch_output_text[0]
 ```
+<!-- ## 📰 Evaluation Results
 ### DeepSeek-R1-Evaluation
  For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.
 | | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | **91.8** |
 | | C-SimpleQA (Correct) | 55.4 | 58.7 | **68.0** | 40.3 | - | 63.7 |
+</div> -->
 ## 🙌 Stay Connected!