Hyperparameters for replication
Hi team, great blog post and research! I am trying to replicate the results on the countdown task mentioned in the write up but having trouble doing so without the same training parameters. When I tried to distill Qwen/Qwen3-4B-Instruct-2507 into Qwen/Qwen2.5-1.5B-Instruct with
training_args = GOLDConfig(
lmbda=1.0,
beta=0.0,
use_uld_loss=True,
uld_use_hybrid_loss=True,
learning_rate=5e-6,
warmup_ratio=0.05,
max_length=1024 + 136,
max_completion_length=1024,
per_device_train_batch_size=16,
gradient_accumulation_steps=2,
num_train_epochs=1,
lr_scheduler_type="linear",
...
)
Loss decreases over 500 steps but eval scores are only fluctuating around the starting value. Before running longer training I would really appreciate if you can share the configs used for the results presented in the blog post.
Hi @thomasip , thank you!
The differences I see in your setup compared to what we had are:
- learning_rate: We used 1e-7.
- lr_scheduler_type: 'cosine'
- num_train_epochs: 5
- Your effective batch size (EBS) looks ok because we use an EBS of 32. I'm not familiar with your distributed setup, so please double-check that part.
Hopefully, the changes to those parameters will work well in the replication.
Thanks, will try those! Forget to mention but I am also using LoRA:
peft_config = LoraConfig(
r=16,
lora_alpha=16 * 2,
)
Do you think these would require much adjustments to the training params?
We didn't test training with LoRA, so we don't have results for that. Judging from the results [published by Thinking Machines], performance degrades with LoRA, but it still performs well (see Figure 8 from their blog).
I managed to reproduce the results, thanks for your help!
Here are my hyperparams with LoRA:
training_args = GOLDConfig(
lmbda=1.0,
beta=0.0,
use_uld_loss=True,
uld_use_hybrid_loss=True,
learning_rate=5e-5,
warmup_ratio=0.05,
max_length=1024 + 136,
max_completion_length=1024,
per_device_train_batch_size=16,
gradient_accumulation_steps=1,
num_train_epochs=5,
lr_scheduler_type="cosine",
)
peft_config = LoraConfig(
r=32,
lora_alpha=32,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
],
lora_dropout=0.1,
)
Teacher: Qwen/Qwen3-4B-Instruct-2507
Student: Qwen/Qwen2.5-1.5B-Instruct (0.746 at 900 steps) and meta-llama/Llama-3.2-1B-Instruct (0.608 at 800 steps)
I trained on a train/test split of verified_Qwen3-4B-Instruct-2507 and eval on the test split so the score on the real test set is likely slightly lower.
Made a few patches to GOLDTrainer to fix some tokenization bugs which seems to have improve the scores even more.
This is amazing! Which bugs did you encounter?
You also achieve great performance with fewer steps (with and without the patch), so I guess LoRA works quite well with GOLD. That is an awesome finding.
Let Llama trained for a bit more.
Ran eval on the full 10k test subset. Results:Qwen/Qwen2.5-1.5B-Instruct (1200 steps): 0.6314meta-llama/Llama-3.2-1B-Instruct (1600 steps): 0.556
(Lower than eval scores from the verified_Qwen3-4B-Instruct-2507 train/test split as expected)
Which bugs did you encounter?
Special tokens aren't handled properly when translating the student prompt to the teacher prompt. This won't break training but might cause downstream issues like not adhering to the system instruction. There were also some prompt alignment issue if I remember correctly. I'll post a patch at some point.
