|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: Qwen/Qwen2.5-VL-3B-Instruct |
|
|
datasets: |
|
|
- TESS-Computer/quickdraw-circles |
|
|
tags: |
|
|
- trajectory-prediction |
|
|
- diffusion-transformer |
|
|
- vision-language |
|
|
- robotics |
|
|
- drawing |
|
|
pipeline_tag: image-to-image |
|
|
--- |
|
|
|
|
|
# Qwen-DiT-Draw |
|
|
|
|
|
A Vision-Language Model with Diffusion Transformer head for trajectory prediction. Given an image and instruction, the model predicts drawing trajectories. |
|
|
|
|
|
**Architecture:** Frozen Qwen2.5-VL-3B backbone + trainable DiT action head (36.7M params) |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model:** [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) |
|
|
- **Training Data:** [TESS-Computer/quickdraw-circles](https://huggingface.co/datasets/TESS-Computer/quickdraw-circles) (21k circle drawings) |
|
|
- **Architecture:** GR00T-style chunked prediction with flow matching |
|
|
- **Trainable Parameters:** 36.7M (DiT head only, VLM frozen) |
|
|
- **Chunk Size:** 16 points per chunk |
|
|
- **Output:** (x, y, state) where state > 0.5 indicates stop signal |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from PIL import Image |
|
|
from transformers import AutoProcessor |
|
|
from qwen_vl_utils import process_vision_info |
|
|
|
|
|
# You need the model code from: https://github.com/HusseinLezzaik/Qwen-DiT-Draw |
|
|
from src.model import Qwen2_5_VL_Draw, TrajectoryConfig |
|
|
|
|
|
# Load model |
|
|
config = TrajectoryConfig(chunk_size=16, dit_hidden_size=512, dit_num_layers=6) |
|
|
model = Qwen2_5_VL_Draw( |
|
|
model_id="Qwen/Qwen2.5-VL-3B-Instruct", |
|
|
config=config, |
|
|
freeze_backbone=True, |
|
|
dtype=torch.bfloat16, |
|
|
) |
|
|
|
|
|
# Load trained weights |
|
|
from huggingface_hub import hf_hub_download |
|
|
weights_path = hf_hub_download(repo_id="TESS-Computer/qwen-dit-draw", filename="trajectory_head.pt") |
|
|
model.trajectory_head.load_state_dict(torch.load(weights_path, weights_only=True)) |
|
|
model = model.to("cuda").eval() |
|
|
|
|
|
# Load processor |
|
|
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct") |
|
|
|
|
|
# Create input |
|
|
image = Image.new("RGB", (512, 512), "white") # White canvas |
|
|
instruction = "draw a circle" |
|
|
|
|
|
messages = [{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image", "image": image, "min_pixels": 200704, "max_pixels": 401408}, |
|
|
{"type": "text", "text": instruction}, |
|
|
], |
|
|
}] |
|
|
|
|
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True) |
|
|
inputs = processor(text=[text], images=image_inputs, return_tensors="pt") |
|
|
inputs = {k: v.to("cuda") if torch.is_tensor(v) else v for k, v in inputs.items()} |
|
|
|
|
|
# Predict trajectory chunk |
|
|
with torch.no_grad(): |
|
|
chunk = model.predict_chunk(**inputs) |
|
|
|
|
|
chunk = chunk[0].float().cpu().numpy() # (16, 3) - (x, y, state) |
|
|
print(f"Predicted {len(chunk)} points") |
|
|
for i, (x, y, state) in enumerate(chunk): |
|
|
print(f" Point {i}: ({x:.3f}, {y:.3f}), stop={state > 0.5}") |
|
|
``` |
|
|
|
|
|
## Multi-Chunk Inference (Full Drawing) |
|
|
|
|
|
For complete drawings, use visual feedback loop: |
|
|
|
|
|
```python |
|
|
from PIL import ImageDraw |
|
|
|
|
|
canvas = Image.new("RGB", (512, 512), "white") |
|
|
all_points = [] |
|
|
max_chunks = 10 |
|
|
|
|
|
for chunk_idx in range(max_chunks): |
|
|
# Prepare inputs with current canvas |
|
|
messages = [{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image", "image": canvas, "min_pixels": 200704, "max_pixels": 401408}, |
|
|
{"type": "text", "text": "draw a circle"}, |
|
|
], |
|
|
}] |
|
|
# ... process and predict ... |
|
|
|
|
|
# Draw on canvas (use BLACK lines to match training!) |
|
|
draw = ImageDraw.Draw(canvas) |
|
|
for i in range(1, len(chunk)): |
|
|
x1, y1 = int(chunk[i-1][0] * 512), int(chunk[i-1][1] * 512) |
|
|
x2, y2 = int(chunk[i][0] * 512), int(chunk[i][1] * 512) |
|
|
draw.line([(x1, y1), (x2, y2)], fill='black', width=2) |
|
|
|
|
|
if chunk[i][2] > 0.5: # Stop signal |
|
|
break |
|
|
``` |
|
|
|
|
|
## Training |
|
|
|
|
|
Trained on Modal H100 for 2 epochs using flow matching loss. See [training code](https://github.com/HusseinLezzaik/Qwen-DiT-Draw). |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{qwen-dit-draw, |
|
|
author = {TESS Computer}, |
|
|
title = {Qwen-DiT-Draw: VLM + DiT for Trajectory Prediction}, |
|
|
year = {2025}, |
|
|
url = {https://huggingface.co/TESS-Computer/qwen-dit-draw} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Links |
|
|
|
|
|
- **Code:** [GitHub - Qwen-DiT-Draw](https://github.com/HusseinLezzaik/Qwen-DiT-Draw) |
|
|
- **Dataset:** [TESS-Computer/quickdraw-circles](https://huggingface.co/datasets/TESS-Computer/quickdraw-circles) |
|
|
- **Base Model:** [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) |
|
|
|