llama32-1b-dpo-llm-judge / README.md

Zickl

Upload README.md with huggingface_hub

c5d17a7 verified 17 days ago

preview code

raw

history blame contribute delete

2.26 kB

metadata

base_model: meta-llama/Llama-3.2-1B-Instruct
tags:
  - dpo
  - preference-learning
  - llm-judge
  - peft
  - lora
license: llama3.2
language:
  - en

Llama-3.2-1B DPO Fine-tuned (LLM Judge)

This model is a DPO (Direct Preference Optimization) fine-tuned version of Llama-3.2-1B-Instruct, trained on preference data generated using an LLM judge system.

Training Details

Base Model: meta-llama/Llama-3.2-1B-Instruct
Training Method: DPO (Direct Preference Optimization)
Dataset: LLM Judge preference pairs (15 samples)
LoRA Configuration: r=16, alpha=32
Training Epochs: 3
Beta (DPO temperature): 0.1
Learning Rate: 5e-5

Preference Collection Method

The training dataset was created using an LLM-based judge system that evaluates responses based on:

Helpfulness
Accuracy
Safety
Coherence
Conciseness

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "Zickl/llama32-1b-dpo-llm-judge")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

# Generate
messages = [{"role": "user", "content": "Your question here"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Logs

Agreement Rate (LLM Judge vs PairRM): 93.3%
Training completed successfully with stable loss convergence

Limitations

Trained on small dataset (15 preference pairs)
May exhibit judge biases
Optimized for specific evaluation criteria
1B parameter model has inherent capability limits

Citation

If you use this model, please cite:

@misc{llama32-dpo-llm-judge,
  author = {Zickl},
  title = {Llama-3.2-1B DPO Fine-tuned with LLM Judge},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Zickl/llama32-1b-dpo-llm-judge}
}