on-policy-distillation

Running

Hyperparameters for replication

by thomasip - opened 20 days ago

20 days ago

Hi team, great blog post and research! I am trying to replicate the results on the countdown task mentioned in the write up but having trouble doing so without the same training parameters. When I tried to distill Qwen/Qwen3-4B-Instruct-2507 into Qwen/Qwen2.5-1.5B-Instruct with

training_args = GOLDConfig(
    lmbda=1.0,
    beta=0.0,
    use_uld_loss=True,
    uld_use_hybrid_loss=True,
    learning_rate=5e-6,
    warmup_ratio=0.05,
    max_length=1024 + 136,
    max_completion_length=1024,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,
    num_train_epochs=1,
    lr_scheduler_type="linear",
    ...
)

Loss decreases over 500 steps but eval scores are only fluctuating around the starting value. Before running longer training I would really appreciate if you can share the configs used for the results presented in the blog post.

cmpatino

Hugging Face H4 org 20 days ago

Hi @thomasip , thank you!

The differences I see in your setup compared to what we had are:

learning_rate: We used 1e-7.
lr_scheduler_type: 'cosine'
num_train_epochs: 5
Your effective batch size (EBS) looks ok because we use an EBS of 32. I'm not familiar with your distributed setup, so please double-check that part.

Hopefully, the changes to those parameters will work well in the replication.

thomasip

20 days ago

Thanks, will try those! Forget to mention but I am also using LoRA:

peft_config = LoraConfig(
    r=16,
    lora_alpha=16 * 2,
)

Do you think these would require much adjustments to the training params?

cmpatino

Hugging Face H4 org 19 days ago

We didn't test training with LoRA, so we don't have results for that. Judging from the results [published by Thinking Machines], performance degrades with LoRA, but it still performs well (see Figure 8 from their blog).

thomasip

16 days ago

I managed to reproduce the results, thanks for your help!

Here are my hyperparams with LoRA:

training_args = GOLDConfig(
    lmbda=1.0,
    beta=0.0,
    use_uld_loss=True,
    uld_use_hybrid_loss=True,
    learning_rate=5e-5,
    warmup_ratio=0.05,
    max_length=1024 + 136,
    max_completion_length=1024,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,
    num_train_epochs=5,
    lr_scheduler_type="cosine",
)
peft_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_dropout=0.1,
)

Teacher: Qwen/Qwen3-4B-Instruct-2507
Student: Qwen/Qwen2.5-1.5B-Instruct (0.746 at 900 steps) and meta-llama/Llama-3.2-1B-Instruct (0.608 at 800 steps)
I trained on a train/test split of verified_Qwen3-4B-Instruct-2507 and eval on the test split so the score on the real test set is likely slightly lower.

Made a few patches to GOLDTrainer to fix some tokenization bugs which seems to have improve the scores even more.

cmpatino

Hugging Face H4 org 16 days ago

This is amazing! Which bugs did you encounter?

You also achieve great performance with fewer steps (with and without the patch), so I guess LoRA works quite well with GOLD. That is an awesome finding.

thomasip

15 days ago

•

edited 15 days ago

Let Llama trained for a bit more.

Ran eval on the full 10k test subset. Results:
Qwen/Qwen2.5-1.5B-Instruct (1200 steps): 0.6314
meta-llama/Llama-3.2-1B-Instruct (1600 steps): 0.556
(Lower than eval scores from the verified_Qwen3-4B-Instruct-2507 train/test split as expected)

Which bugs did you encounter?

Special tokens aren't handled properly when translating the student prompt to the teacher prompt. This won't break training but might cause downstream issues like not adhering to the system instruction. There were also some prompt alignment issue if I remember correctly. I'll post a patch at some point.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment