metadata
base_model: meta-llama/Llama-3.2-1B-Instruct
tags:
- dpo
- preference-learning
- llm-judge
- peft
- lora
license: llama3.2
language:
- en
Llama-3.2-1B DPO Fine-tuned (LLM Judge)
This model is a DPO (Direct Preference Optimization) fine-tuned version of Llama-3.2-1B-Instruct, trained on preference data generated using an LLM judge system.
Training Details
- Base Model: meta-llama/Llama-3.2-1B-Instruct
- Training Method: DPO (Direct Preference Optimization)
- Dataset: LLM Judge preference pairs (15 samples)
- LoRA Configuration: r=16, alpha=32
- Training Epochs: 3
- Beta (DPO temperature): 0.1
- Learning Rate: 5e-5
Preference Collection Method
The training dataset was created using an LLM-based judge system that evaluates responses based on:
- Helpfulness
- Accuracy
- Safety
- Coherence
- Conciseness
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
device_map="auto"
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "Zickl/llama32-1b-dpo-llm-judge")
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
# Generate
messages = [{"role": "user", "content": "Your question here"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Logs
- Agreement Rate (LLM Judge vs PairRM): 93.3%
- Training completed successfully with stable loss convergence
Limitations
- Trained on small dataset (15 preference pairs)
- May exhibit judge biases
- Optimized for specific evaluation criteria
- 1B parameter model has inherent capability limits
Citation
If you use this model, please cite:
@misc{llama32-dpo-llm-judge,
author = {Zickl},
title = {Llama-3.2-1B DPO Fine-tuned with LLM Judge},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/Zickl/llama32-1b-dpo-llm-judge}
}