GPT-5-Distill-llama3.1-8B-Instruct

Unsloth Llama-3 Distillation

Model Summary

GPT-5-Distill-llama3.1-8B-Instruct is a fine-tuned version of meta-llama/Llama-3.1-8B-Instruct, designed to distill the capabilities of high-performance models (labeled as GPT-5 in source datasets) into a more efficient 8B parameter footprint.

This model was trained using Unsloth on a curated mix of approximately 164,000 high-quality instruction-response pairs, focusing on complex reasoning and "normal" flaw-level responses.

  • Base Model: meta-llama/Llama-3.1-8B-Instruct
  • Architecture: Llama 3.1 (8B parameters)
  • Language: English (Primary)
  • Context Window: 32,768 tokens
  • Fine-tuning Framework: Unsloth (QLoRA)

✨ Key Advantages of GPT-5 Distillation

This model represents a shift towards "Super-Knowledge Distillation", where a smaller, efficient student model learns from a significantly more capable teacher.

  • πŸš€ Frontier-Level Reasoning: By training on dataset samples attributed to GPT-5, the model acquires complex reasoning patterns, nuance, and problem-solving strategies that are typically absent in standard datasets or smaller models.
  • ⚑ Efficient Intelligence: Users can experience high-fidelity, coherent, and detailed responses on consumer hardware (e.g., single GPUs) without the latency, privacy concerns, or cost of querying giant proprietary APIs.
  • πŸ’Ž High-Purity Signal: The strict filtering for flaw == "normal" ensures the model is fine-tuned only on the highest confidence, error-free responses. This minimizes "hallucination inheritance" and aligns the model with safe, helpful behaviors.
  • 🎯 Enhanced Nuance & Tone: Unlike standard finetunes that often sound robotic, this model mimics the more natural, conversational, and adaptive tone found in next-generation frontier models.

πŸ“š Training Data

The model was trained on a high-quality blend of two datasets, totaling 163,896 samples:

  1. Chat-GPT-5-Chat-Response (160k samples)
    • Filtered specifically for normal entries to ensure high-quality, safe, and coherent responses.
    • This dataset serves as the primary distillation source, aiming to mimic the response patterns of advanced large language models.
  2. ShareGPT-Qwen3-235B-A22B-Instuct-2507 (3.9k samples)
    • "This dataset consists of approximately 3.9k examples, with an average of about 5 rounds of dialogue per scenario, designed to enhance the model’s instruction-following ability and task-completion efficiency.

All data was formatted using the standard Llama-3 Chat Template.

βš™οΈ Training Details

  • Hardware: NVIDIA H100
  • Sequence Length: 32,768 tokens (Long Context Support)
  • Batch Size: 4 per device (Effective Batch Size: 32 via Gradient Accumulation)
  • Learning Rate: 2e-5
  • Scheduler: Linear
  • Optimizer: AdamW 8-bit
  • LoRA Rank (r): 32
  • LoRA Alpha: 32
  • Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

πŸ›‘οΈ License & Limitations

  • License: This model is subject to the Llama 3.1 Community License.
  • Limitations: While this model is distilled from high-capability sources, it is still an 8B parameter model. It may hallucinate facts or struggle with extremely complex reasoning tasks compared to the original teacher models. The "GPT-5" naming refers to the source dataset labels and does not imply access to unreleased OpenAI weights.
Downloads last month
40
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Jackrong/GPT-5-Distill-llama3.1-8B-Instruct

Finetuned
(2055)
this model