| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | tags: |
| | - spatial-reasoning |
| | - multimodal |
| | - vision-language |
| | - scene-graph |
| | - reinforcement-learning |
| | base_model: Qwen/Qwen2.5-VL-3B-Instruct |
| | pipeline_tag: image-text-to-text |
| | --- |
| | |
| | # SpatialThinker-3B |
| |
|
| | <p align="center"> |
| | <a href="https://arxiv.org/abs/2511.07403"> |
| | <img src="https://img.shields.io/badge/arXiv-2511.07403-b31b1b.svg" alt="arXiv"> |
| | </a> |
| | <a href="https://hunarbatra.com/SpatialThinker"> |
| | <img src="https://img.shields.io/badge/π%20Project%20Page-blue.svg" alt="Project Page"> |
| | </a> |
| | <a href="https://github.com/hunarbatra/SpatialThinker"> |
| | <img src="https://img.shields.io/badge/GitHub-Repository-black.svg" alt="GitHub"> |
| | </a> |
| | </p> |
| | |
| | **SpatialThinker-3B** is a 3D-aware multimodal large language model (MLLM) trained with reinforcement learning to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. |
| |
|
| | ## Model Description |
| |
|
| | - **Base Model**: Qwen2.5-VL-3B-Instruct |
| | - **Training**: GRPO (Group Relative Policy Optimization) with dense spatial rewards |
| | - **Training Data**: STVQA-7K (7,587 spatial VQA samples) |
| | - **Authors**: Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark |
| | - **Institutions**: University of Oxford, UC Santa Cruz |
| |
|
| | ## Key Features |
| |
|
| | - **Structured Spatial Reasoning**: Constructs question-focused scene subgraphs with objects, bounding boxes, and relations |
| | - **Dense Spatial Rewards**: Multi-objective reward function enforcing format, count, accuracy, and spatial grounding |
| | - **9 Spatial Reasoning Categories**: Relations, reach, size, orientation, instance location, depth, distance, count, and existence |
| | - **Outperforms GPT-4o**: On spatial understanding benchmarks while using only 7K training samples |
| |
|
| | ## Inference Template |
| |
|
| | Use the following template for inference: |
| |
|
| | ``` |
| | You FIRST observe the image in <observe> </observe> tags, then visualise the relevant scene graph in <scene> </scene> tags, followed by thinking about the reasoning process as an internal monologue within <think> </think> tags and then provide the final answer. The final answer MUST BE put within <answer> </answer> tags, and only return the final choice including the correct option and answer within the answer tags, e.g., <answer> (A) cat </answer>. |
| | |
| | Image size: {Width} x {Height} |
| | ``` |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor |
| | from PIL import Image |
| | |
| | model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
| | "OX-PIXL/SpatialThinker-3B", |
| | torch_dtype="auto", |
| | device_map="auto" |
| | ) |
| | processor = AutoProcessor.from_pretrained("OX-PIXL/SpatialThinker-3B") |
| | |
| | # Load image |
| | image = Image.open("your_image.jpg") |
| | width, height = image.size |
| | |
| | # Prepare prompt with template |
| | template = f"""You FIRST observe the image in <observe> </observe> tags, then visualise the relevant scene graph in <scene> </scene> tags, followed by thinking about the reasoning process as an internal monologue within <think> </think> tags and then provide the final answer. The final answer MUST BE put within <answer> </answer> tags, and only return the final choice including the correct option and answer within the answer tags, e.g., <answer> (A) cat </answer>. |
| | |
| | Image size: {width} x {height}""" |
| | |
| | question = "Where is the cat relative to the couch? (A) on top of (B) in front of (C) behind (D) beside" |
| | |
| | messages = [ |
| | { |
| | "role": "user", |
| | "content": [ |
| | {"type": "image", "image": image}, |
| | {"type": "text", "text": template + "\n\n" + question}, |
| | ], |
| | } |
| | ] |
| | |
| | text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| | inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device) |
| | |
| | generated_ids = model.generate(**inputs, max_new_tokens=1024) |
| | output = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] |
| | print(output) |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{batra2025spatialthinkerreinforcing3dreasoning, |
| | title={SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards}, |
| | author={Hunar Batra and Haoqin Tu and Hardy Chen and Yuanze Lin and Cihang Xie and Ronald Clark}, |
| | year={2025}, |
| | eprint={2511.07403}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CV}, |
| | url={https://arxiv.org/abs/2511.07403}, |
| | } |
| | ``` |
| |
|
| | ## Links |
| |
|
| | - π **Paper**: [arXiv:2511.07403](https://arxiv.org/abs/2511.07403) |
| | - π **Project Page**: [hunarbatra.com/SpatialThinker](https://hunarbatra.com/SpatialThinker) |
| | - π» **GitHub**: [github.com/hunarbatra/SpatialThinker](https://github.com/hunarbatra/SpatialThinker) |
| | - π€ **Dataset**: [OX-PIXL/STVQA-7K](https://huggingface.co/datasets/OX-PIXL/STVQA-7K) |
| |
|