qwen-dit-draw / README.md

Upload README.md with huggingface_hub

25007bc verified 12 days ago

4.46 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen2.5-VL-3B-Instruct
	datasets:
	- TESS-Computer/quickdraw-circles
	tags:
	- trajectory-prediction
	- diffusion-transformer
	- vision-language
	- robotics
	- drawing
	pipeline_tag: image-to-image
	---

	# Qwen-DiT-Draw

	A Vision-Language Model with Diffusion Transformer head for trajectory prediction. Given an image and instruction, the model predicts drawing trajectories.

	Architecture: Frozen Qwen2.5-VL-3B backbone + trainable DiT action head (36.7M params)

	## Model Details

	- Base Model: [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
	- Training Data: [TESS-Computer/quickdraw-circles](https://huggingface.co/datasets/TESS-Computer/quickdraw-circles) (21k circle drawings)
	- Architecture: GR00T-style chunked prediction with flow matching
	- Trainable Parameters: 36.7M (DiT head only, VLM frozen)
	- Chunk Size: 16 points per chunk
	- Output: (x, y, state) where state > 0.5 indicates stop signal

	## Usage

	```python
	import torch
	from PIL import Image
	from transformers import AutoProcessor
	from qwen_vl_utils import process_vision_info

	# You need the model code from: https://github.com/HusseinLezzaik/Qwen-DiT-Draw
	from src.model import Qwen2_5_VL_Draw, TrajectoryConfig

	# Load model
	config = TrajectoryConfig(chunk_size=16, dit_hidden_size=512, dit_num_layers=6)
	model = Qwen2_5_VL_Draw(
	model_id="Qwen/Qwen2.5-VL-3B-Instruct",
	config=config,
	freeze_backbone=True,
	dtype=torch.bfloat16,
	)

	# Load trained weights
	from huggingface_hub import hf_hub_download
	weights_path = hf_hub_download(repo_id="TESS-Computer/qwen-dit-draw", filename="trajectory_head.pt")
	model.trajectory_head.load_state_dict(torch.load(weights_path, weights_only=True))
	model = model.to("cuda").eval()

	# Load processor
	processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

	# Create input
	image = Image.new("RGB", (512, 512), "white") # White canvas
	instruction = "draw a circle"

	messages = [{
	"role": "user",
	"content": [
	{"type": "image", "image": image, "min_pixels": 200704, "max_pixels": 401408},
	{"type": "text", "text": instruction},
	],
	}]

	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True)
	inputs = processor(text=[text], images=image_inputs, return_tensors="pt")
	inputs = {k: v.to("cuda") if torch.is_tensor(v) else v for k, v in inputs.items()}

	# Predict trajectory chunk
	with torch.no_grad():
	chunk = model.predict_chunk(**inputs)

	chunk = chunk[0].float().cpu().numpy() # (16, 3) - (x, y, state)
	print(f"Predicted {len(chunk)} points")
	for i, (x, y, state) in enumerate(chunk):
	print(f" Point {i}: ({x:.3f}, {y:.3f}), stop={state > 0.5}")
	```

	## Multi-Chunk Inference (Full Drawing)

	For complete drawings, use visual feedback loop:

	```python
	from PIL import ImageDraw

	canvas = Image.new("RGB", (512, 512), "white")
	all_points = []
	max_chunks = 10

	for chunk_idx in range(max_chunks):
	# Prepare inputs with current canvas
	messages = [{
	"role": "user",
	"content": [
	{"type": "image", "image": canvas, "min_pixels": 200704, "max_pixels": 401408},
	{"type": "text", "text": "draw a circle"},
	],
	}]
	# ... process and predict ...

	# Draw on canvas (use BLACK lines to match training!)
	draw = ImageDraw.Draw(canvas)
	for i in range(1, len(chunk)):
	x1, y1 = int(chunk[i-1][0] * 512), int(chunk[i-1][1] * 512)
	x2, y2 = int(chunk[i][0] * 512), int(chunk[i][1] * 512)
	draw.line([(x1, y1), (x2, y2)], fill='black', width=2)

	if chunk[i][2] > 0.5: # Stop signal
	break
	```

	## Training

	Trained on Modal H100 for 2 epochs using flow matching loss. See [training code](https://github.com/HusseinLezzaik/Qwen-DiT-Draw).

	## Citation

	```bibtex
	@misc{qwen-dit-draw,
	author = {TESS Computer},
	title = {Qwen-DiT-Draw: VLM + DiT for Trajectory Prediction},
	year = {2025},
	url = {https://huggingface.co/TESS-Computer/qwen-dit-draw}
	}
	```

	## Links

	- Code: [GitHub - Qwen-DiT-Draw](https://github.com/HusseinLezzaik/Qwen-DiT-Draw)
	- Dataset: [TESS-Computer/quickdraw-circles](https://huggingface.co/datasets/TESS-Computer/quickdraw-circles)
	- Base Model: [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)