Text Generation
Transformers
Safetensors
llama
text-generation-inference

New all layer 160M token finetune tildeopen-30b-mu-instruct

#14
by martinsu - opened

SPACE: https://huggingface.co/spaces/martinsu/tildeopen-30b-mu-instruct-space

Probably will run with resetting quotas at midnight.

MODEL: https://huggingface.co/martinsu/tildeopen-30b-mu-instruct

Kudos to the Tilde team for a great base model and for taming that large LUMI beast β€” that was probably a crazy journey!

This is a fine-tuned 30B multilingual instruction model. It shows strong performance on the EuroBlocks multilingual evaluation compared to similarly sized models, with notably concise outputs.

These benchmarks for now are basically smoke tests to verify I didn't create a disaster. Always run your own evaluations for your specific use case.

I'll definitely use it for LV language work as a Gemma 3 replacement; it seems more capable. It seems to have acquired proper alignment from broad training sets too, at least at a basic level.

I'll run and publish more tests, perhaps using quantization.

On top of this fine-tune, one can use a lighter touch to nudge the model toward the right predictions.

Any suggestions? Ideas? As for me, i now have new local LV LLM workhorse, hope others too will have this useful.

Hyperparameters:

LR: 2e-5, cosine schedule, 3% warmup
Batch: 24 effective
Seq length: 4096
Weight decay: 0.01, grad clip: 1.0
Steps: 7,514 (1 epoch)

Data (163M tokens, 181K examples):

HuggingFaceH4/ultrachat_200k (20% sampling) β†’ 41.6K examples, 59.7M tokens
utter-project/EuroBlocks-SFT-Synthetic-1124 (20% sampling) β†’ 85K examples, 58.6M tokens
galileo-ai/ragbench all 12 subsets (30% sampling) β†’ 22K examples, 26.4M tokens
    Subsets: covidqa, cuad, delucionqa, emanual, expertqa, finqa, hagrid, hotpotqa, msmarco, pubmedqa, tatqa, techqa
martinsu/latvian-wikipedia-qa-gemma3 (20% sampling, filtered) β†’ 22.3K examples, 16.7M tokens
yahma/alpaca-cleaned (20% sampling) β†’ 10.4K examples, 2.5M tokens

Language breakdown (163M tokens across 25 languages):

English: 117.7M (72%) - primary language
Latvian: 16.7M (10%) - European focus
Chinese: 10.1M (6%) - Asian coverage
Portuguese: 3.0M (2%) - Romance
Italian: 2.3M (1.4%) - Romance
Spanish: 2.1M (1.3%) - Romance
Hindi: 2.0M (1.2%)
French: 1.8M (1.1%) - Romance
German: 1.4M (0.8%) - Germanic
Dutch: 1.1M (0.7%) - Germanic
Plus 15 more: Japanese, Ukrainian, Swedish, Hungarian, Polish, Czech, Russian, Korean, Romanian, Finnish, Greek, Slovak, Norwegian, Slovenian, Estonian (4.9M combined, 3%)

Response-only training: Custom collator masks user/system messages, loss only on assistant responses.
ChatML Template Format.

Hi, Martins!

Thanks for the great work on the model, it is quite impressive! It is nice to see the base model being put to a real use by the community.
If you want to get in touch for some feedback and exchange experience in greater detail, please feel free to contact me: martins.kronis@tilde.lv

Best,
Martins from Tilde Open Team

Sign up or log in to comment