Update README.md
Browse files
README.md
CHANGED
|
@@ -1,17 +1,75 @@
|
|
| 1 |
---
|
| 2 |
-
base_model: meta-llama-3.1-
|
|
|
|
|
|
|
| 3 |
tags:
|
| 4 |
-
- text-generation-inference
|
| 5 |
-
- transformers
|
| 6 |
- unsloth
|
|
|
|
| 7 |
- llama
|
| 8 |
-
|
|
|
|
|
|
|
|
|
|
| 9 |
language:
|
| 10 |
- en
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
-
#
|
| 14 |
|
| 15 |
-
-
|
| 16 |
-
-
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model: meta-llama/Llama-3.1-8B-Instruct
|
| 3 |
+
library_name: transformers
|
| 4 |
+
model_name: GPT-5-Distill-llama3.1-8B-Instruct
|
| 5 |
tags:
|
|
|
|
|
|
|
| 6 |
- unsloth
|
| 7 |
+
- llama-3
|
| 8 |
- llama
|
| 9 |
+
- text-generation
|
| 10 |
+
- distillation
|
| 11 |
+
- gpt-5
|
| 12 |
+
license: llama3.1
|
| 13 |
language:
|
| 14 |
- en
|
| 15 |
+
- zh
|
| 16 |
---
|
| 17 |
|
| 18 |
+
# GPT-5-Distill-llama3.1-8B-Instruct
|
| 19 |
|
| 20 |
+

|
| 21 |
+

|
| 22 |
+

|
| 23 |
+
|
| 24 |
+
## Model Summary
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/PNNVeEd1bKdL3F7oXCj5M.png" width="800" />
|
| 28 |
+
|
| 29 |
+
**GPT-5-Distill-llama3.1-8B-Instruct** is a fine-tuned version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct), designed to distill the capabilities of high-performance models (labeled as GPT-5 in source datasets) into a more efficient 8B parameter footprint.
|
| 30 |
+
|
| 31 |
+
This model was trained using **Unsloth** on a curated mix of approximately **164,000 high-quality instruction-response pairs**, focusing on complex reasoning and "normal" flaw-level responses.
|
| 32 |
+
|
| 33 |
+
* **Base Model:** `meta-llama/Llama-3.1-8B-Instruct`
|
| 34 |
+
* **Architecture:** Llama 3.1 (8B parameters)
|
| 35 |
+
* **Language:** English (Primary)
|
| 36 |
+
* **Context Window:** 32,768 tokens
|
| 37 |
+
* **Fine-tuning Framework:** [Unsloth](https://github.com/unslothai/unsloth) (QLoRA)
|
| 38 |
+
|
| 39 |
+
## ✨ Key Advantages of GPT-5 Distillation
|
| 40 |
+
|
| 41 |
+
This model represents a shift towards **"Super-Knowledge Distillation"**, where a smaller, efficient student model learns from a significantly more capable teacher.
|
| 42 |
+
|
| 43 |
+
* **🚀 Frontier-Level Reasoning**: By training on dataset samples attributed to GPT-5, the model acquires complex reasoning patterns, nuance, and problem-solving strategies that are typically absent in standard datasets or smaller models.
|
| 44 |
+
* **⚡ Efficient Intelligence**: Users can experience high-fidelity, coherent, and detailed responses on consumer hardware (e.g., single GPUs) without the latency, privacy concerns, or cost of querying giant proprietary APIs.
|
| 45 |
+
* **💎 High-Purity Signal**: The strict filtering for `flaw == "normal"` ensures the model is fine-tuned only on the highest confidence, error-free responses. This minimizes "hallucination inheritance" and aligns the model with safe, helpful behaviors.
|
| 46 |
+
* **🎯 Enhanced Nuance & Tone**: Unlike standard finetunes that often sound robotic, this model mimics the more natural, conversational, and adaptive tone found in next-generation frontier models.
|
| 47 |
+
|
| 48 |
+
## 📚 Training Data
|
| 49 |
+
|
| 50 |
+
The model was trained on a high-quality blend of two datasets, totaling **163,896 samples**:
|
| 51 |
+
|
| 52 |
+
1. **Chat-GPT-5-Chat-Response (160k samples)**
|
| 53 |
+
* Filtered specifically for normal entries to ensure high-quality, safe, and coherent responses.
|
| 54 |
+
* This dataset serves as the primary distillation source, aiming to mimic the response patterns of advanced large language models.
|
| 55 |
+
2. **ShareGPT-Qwen3-235B-A22B-Instuct-2507 (3.9k samples)**
|
| 56 |
+
* "This dataset consists of approximately **3.9k examples**, with an average of about **5 rounds of dialogue** per scenario, designed to enhance the model’s instruction-following ability and task-completion efficiency.
|
| 57 |
+
|
| 58 |
+
All data was formatted using the standard **Llama-3 Chat Template**.
|
| 59 |
+
|
| 60 |
+
## ⚙️ Training Details
|
| 61 |
+
|
| 62 |
+
* **Hardware:** NVIDIA H100
|
| 63 |
+
* **Sequence Length:** 32,768 tokens (Long Context Support)
|
| 64 |
+
* **Batch Size:** 4 per device (Effective Batch Size: 32 via Gradient Accumulation)
|
| 65 |
+
* **Learning Rate:** 2e-5
|
| 66 |
+
* **Scheduler:** Linear
|
| 67 |
+
* **Optimizer:** AdamW 8-bit
|
| 68 |
+
* **LoRA Rank (r):** 32
|
| 69 |
+
* **LoRA Alpha:** 32
|
| 70 |
+
* **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
|
| 71 |
+
|
| 72 |
+
## 🛡️ License & Limitations
|
| 73 |
+
|
| 74 |
+
* **License:** This model is subject to the **Llama 3.1 Community License**.
|
| 75 |
+
* **Limitations:** While this model is distilled from high-capability sources, it is still an 8B parameter model. It may hallucinate facts or struggle with extremely complex reasoning tasks compared to the original teacher models. The "GPT-5" naming refers to the source dataset labels and does not imply access to unreleased OpenAI weights.
|