Tiny Reasoning Language Model
Collection
Collection dedicated to the development of the Tiny Reasoning Language Model (trlm)
β’
7 items
β’
Updated
β’
6
trlm-stage-3-dpo-final-2 is the Stage 3 post-training model for the Tiny Reasoning Language Model (trlm) project.
This stage focuses on preference alignment using Direct Preference Optimization (DPO) with 50k preference pairs.
This stage improves the modelβs alignment, coherence, and reasoning stability.
<think> traces This model was trained on the dataset:
π Shekswess/trlm-dpo-stage-3-final-2
Dataset summary:
scottgeng00/olmo-3-preference-mix-deltas_reasoning-yolo_scottmix-DECON-chfiltered | Source Dataset | Split | Entries | % |
|---|---|---|---|
| scottgeng00/olmo-3-preference-mix-deltas_reasoning-yolo_scottmix-DECON-chfiltered | train | 50,000 | 100% |
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Shekswess/trlm-stage-3-dpo-final-2"
# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Example inference with preference-aligned reasoning
messages = [
{"role": "user", "content": "Explain why the sky is blue in simple terms."}
]
# Apply chat template
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Part of the Tiny Reasoning Language Model (trlm) post-training pipeline.
Base model
HuggingFaceTB/SmolLM2-135M