AIcell commited on
Commit
3d31747
Β·
verified Β·
1 Parent(s): 0e85b0d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -7
README.md CHANGED
@@ -107,17 +107,52 @@ pip install qwen-vl-utils
107
  ## πŸ’» Model Downloads and Usage
108
 
109
  ```
110
- # Load model directly
 
 
111
  from transformers import AutoProcessor, AutoModelForImageTextToText
112
 
113
- processor = AutoProcessor.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero")
114
- model = AutoModelForImageTextToText.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero")
115
-
116
- # Prepare input
117
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
  ```
119
 
120
- ## πŸ“° Evaluation Results
121
 
122
  ### DeepSeek-R1-Evaluation
123
  For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.
@@ -151,7 +186,7 @@ model = AutoModelForImageTextToText.from_pretrained("turningpoint-ai/VisualThink
151
  | | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | **91.8** |
152
  | | C-SimpleQA (Correct) | 55.4 | 58.7 | **68.0** | 40.3 | - | 63.7 |
153
 
154
- </div>
155
 
156
  ## πŸ™Œ Stay Connected!
157
 
 
107
  ## πŸ’» Model Downloads and Usage
108
 
109
  ```
110
+ from PIL import Image
111
+ import requests
112
+ from io import BytesIO
113
  from transformers import AutoProcessor, AutoModelForImageTextToText
114
 
 
 
 
 
115
 
116
+ # Load model directly
117
+ processor = AutoProcessor.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero")
118
+ model = AutoModelForImageTextToText.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero",
119
+ , torch_dtype="auto", device_map="auto")
120
+ model.eval()
121
+
122
+ # Prepare image input
123
+ image_url = "https://huggingface.co/datasets/array/SAT/viewer/default/validation?row=2&image-viewer=1FECF8A4A7380558FF5C3E659A8D54DB721032AF"
124
+
125
+ # Prepare text input
126
+ question = "Answer in natural language. I need to go to Chair (near the mark 7 in the image). Which direction should I turn to face the object? look straight or left by 40 degrees."
127
+ prompt = f"A conversation between User and Assistant. The user asks a question about the image, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: {question} \nAssistant: Let me solve this step by step.\n<think>"
128
+
129
+ # Process input
130
+ response = requests.get(image_url)
131
+ image = Image.open(BytesIO(response.content))
132
+ text = processor.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
133
+ input = processor(
134
+ text=text,
135
+ images=image,
136
+ padding=True,
137
+ return_tensors="pt",
138
+ )
139
+ input = inputsto("cuda")
140
+
141
+ # Generation of the output
142
+ with torch.no_grad():
143
+ generated_ids = model.module.generate(**input, use_cache=True, max_new_tokens=1024, do_sample=True)
144
+ generated_ids_trimmed = [
145
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
146
+ ]
147
+ batch_output_text = processor.batch_decode(
148
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
149
+ )
150
+
151
+ # Get output
152
+ output_text = batch_output_text[0]
153
  ```
154
 
155
+ <!-- ## πŸ“° Evaluation Results
156
 
157
  ### DeepSeek-R1-Evaluation
158
  For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.
 
186
  | | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | **91.8** |
187
  | | C-SimpleQA (Correct) | 55.4 | 58.7 | **68.0** | 40.3 | - | 63.7 |
188
 
189
+ </div> -->
190
 
191
  ## πŸ™Œ Stay Connected!
192