Zickl
/

llama32-1b-dpo-llm-judge

preference-learning

Model card Files Files and versions

llama32-1b-dpo-llm-judge / README.md

Zickl's picture

Upload README.md with huggingface_hub

c5d17a7 verified 18 days ago

|

history blame contribute delete

2.26 kB

	---
	base_model: meta-llama/Llama-3.2-1B-Instruct
	tags:
	- dpo
	- preference-learning
	- llm-judge
	- peft
	- lora
	license: llama3.2
	language:
	- en
	---

	# Llama-3.2-1B DPO Fine-tuned (LLM Judge)

	This model is a DPO (Direct Preference Optimization) fine-tuned version of Llama-3.2-1B-Instruct,
	trained on preference data generated using an LLM judge system.

	## Training Details

	- Base Model: meta-llama/Llama-3.2-1B-Instruct
	- Training Method: DPO (Direct Preference Optimization)
	- Dataset: LLM Judge preference pairs (15 samples)
	- LoRA Configuration: r=16, alpha=32
	- Training Epochs: 3
	- Beta (DPO temperature): 0.1
	- Learning Rate: 5e-5

	## Preference Collection Method

	The training dataset was created using an LLM-based judge system that evaluates responses based on:
	- Helpfulness
	- Accuracy
	- Safety
	- Coherence
	- Conciseness

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	# Load base model
	base_model = AutoModelForCausalLM.from_pretrained(
	"meta-llama/Llama-3.2-1B-Instruct",
	device_map="auto"
	)

	# Load LoRA adapter
	model = PeftModel.from_pretrained(base_model, "Zickl/llama32-1b-dpo-llm-judge")

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

	# Generate
	messages = [{"role": "user", "content": "Your question here"}]
	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=256)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Training Logs

	- Agreement Rate (LLM Judge vs PairRM): 93.3%
	- Training completed successfully with stable loss convergence

	## Limitations

	- Trained on small dataset (15 preference pairs)
	- May exhibit judge biases
	- Optimized for specific evaluation criteria
	- 1B parameter model has inherent capability limits

	## Citation

	If you use this model, please cite:
	```
	@misc{llama32-dpo-llm-judge,
	author = {Zickl},
	title = {Llama-3.2-1B DPO Fine-tuned with LLM Judge},
	year = {2024},
	publisher = {HuggingFace},
	url = {https://huggingface.co/Zickl/llama32-1b-dpo-llm-judge}
	}
	```