Zickl's picture
Upload README.md with huggingface_hub
c5d17a7 verified
metadata
base_model: meta-llama/Llama-3.2-1B-Instruct
tags:
  - dpo
  - preference-learning
  - llm-judge
  - peft
  - lora
license: llama3.2
language:
  - en

Llama-3.2-1B DPO Fine-tuned (LLM Judge)

This model is a DPO (Direct Preference Optimization) fine-tuned version of Llama-3.2-1B-Instruct, trained on preference data generated using an LLM judge system.

Training Details

  • Base Model: meta-llama/Llama-3.2-1B-Instruct
  • Training Method: DPO (Direct Preference Optimization)
  • Dataset: LLM Judge preference pairs (15 samples)
  • LoRA Configuration: r=16, alpha=32
  • Training Epochs: 3
  • Beta (DPO temperature): 0.1
  • Learning Rate: 5e-5

Preference Collection Method

The training dataset was created using an LLM-based judge system that evaluates responses based on:

  • Helpfulness
  • Accuracy
  • Safety
  • Coherence
  • Conciseness

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "Zickl/llama32-1b-dpo-llm-judge")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

# Generate
messages = [{"role": "user", "content": "Your question here"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Logs

  • Agreement Rate (LLM Judge vs PairRM): 93.3%
  • Training completed successfully with stable loss convergence

Limitations

  • Trained on small dataset (15 preference pairs)
  • May exhibit judge biases
  • Optimized for specific evaluation criteria
  • 1B parameter model has inherent capability limits

Citation

If you use this model, please cite:

@misc{llama32-dpo-llm-judge,
  author = {Zickl},
  title = {Llama-3.2-1B DPO Fine-tuned with LLM Judge},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Zickl/llama32-1b-dpo-llm-judge}
}