GGUF
conversational

Question about Fine-Tuning Setup / Preferences

#1
by reedmayhew - opened

Hello! Thanks so much for using my datasets. I hope they were helpful! Those are two of my most favorite ones.

I'd be curious if you'd be willing to share more details about your fine-tuning process? I currently use Unsloth, and while it "gets the job done", the results can be a bit rough sometimes.

I noticed you mentioned several thousand steps of fine-tuning. So I'm just interested in what your setup and params look like. (Example: LoRA rank and alpha, what layers you had activated (q,v,k,o,up,down,mlp,etc.) Learning rate, anything else you got!

Hoping to hear from you, keep on coding! - Reed

Hey,

Firstly, I want to thank you for your amazing datasets and models. They inspired me to start working on my own models and datasets.
I also use Unsloth notebooks for fine-tuning, but I haven't changed any LoRA parameters and just hoped that it would work.
The only parts I changed were the learning rate (2e-5 instead of 2e-4) for some of my distills and the sequence length (higher = better result).
After starting to work on these distills, I started collaborating with @armand0e (forming TeichAI). Since then, he's been tweaking the parameters and training the models.
I focus on creating the synthetic prompts for the datasets and exploring other methods (besides SFT) to improve our distills.

Here is a fine-tuning run with heavy overfitting (should have decreased the steps), but that was the base notebook at the time.
https://huggingface.co/Liontix/Qwen3-8B-Gemini-2.5-Pro-Distill/blob/main/Unsloth_Qwen3_Reasoning_Conversational_Edited.ipynb

Maybe @armand0e can give you more insights regarding parameter tweaking.

Thanks for the work and effort you put into your models and datasets!

Regards, Liontix

Accidentally posted twice (wifi is bugging out) Response is below :)

We made a sloppy attempt to publish our training code and some info for beginners to try and replicate what we do at https://docs.teichai.com (note: the open in colab buttons wont work atm as we havent made the notebooks on github).

But again as @Liontix mentioned it does vary a lot depending on factors like the base model (number of params, architecture, etc), dataset size, how much context you want to train on, etc.

I normally start with the r, alpha, learning rate, and epochs/steps from the docs attached above and make small adjustments and/or experiment with different checkpoints saved along the way. Overfitting is still a big issue we face and we're actively exploring alternative training methods so we can transfer style without degrading the original model has become capable. For now this is the best advice I can give for a general sense)

As for which modules to target and things like that I usually stick with whatever unsloth recommends for those models (based on their published notebooks). Otherwise I just try the standard linear attention layers. The only exception to this has been gpt-oss-20b as something just doesn't seem to be working for this model.

Hope all goes well! Please don't hesitate to reach out if you have any questions 😊

Sign up or log in to comment