Dc-4nderson
/

transcript_summarizer_model

@@ -1,57 +1,72 @@
-This LoRA adapter was trained on a custom dataset of 1,000 English transcript examples to teach a Mistral-7B model how to segment long transcripts into topic-based chunks using -- as delimiters.
-It enables automated topic boundary detection in conversation, meeting, and podcast transcripts — ideal for preprocessing before summarization, classification, or retrieval.
-🧩 Training Objective
 The model learns to:
-Detect topic changes in unstructured transcripts
-Insert -- where those shifts occur
-Preserve the original flow of speech
-Example:
-Input:
-Chunk this transcript wherever a new topic begins. Use -- as a delimiter.
-Transcript: Welcome everyone to the meeting. Today we'll discuss project updates and next quarter goals.
-Output:
-Welcome everyone to the meeting -- Today we'll discuss project updates -- and next quarter goals.
-⚙️ Training Configuration
-Base Model: mistralai/Mistral-7B-v0.2
-Adapter Type: LoRA
-PEFT Library: peft==0.10.0
-Training Framework: Hugging Face Transformers
-Epochs: 2
-Optimizer: AdamW
-Learning Rate: 2e-4
-Batch Size: 8
-Sequence Length: 512
-📊 Training Metrics
-Step	Training Loss	Validation Loss	Entropy	Num Tokens	Mean Token Accuracy
-100	0.2961	0.1603	0.1644	204,800	0.9594
-200	0.1362	0.1502	0.1609	409,600	0.9603
-300	0.1360	0.1451	0.1391	612,864	0.9572
-400	0.0951	0.1351	0.1279	817,664	0.9635
-500	0.0947	0.1297	0.0892	1,022,464	0.9657
-Summary:
-Loss steadily decreased over training, and accuracy remained consistently above 95%, indicating the model effectively learned transcript reconstruction and delimiter placement patterns.
-🧰 Usage Example
 from transformers import AutoTokenizer, AutoModelForCausalLM
 from peft import PeftModel
@@ -62,11 +77,15 @@ tokenizer = AutoTokenizer.from_pretrained(base)
 model = AutoModelForCausalLM.from_pretrained(base)
 model = PeftModel.from_pretrained(model, adapter)
-text = "Break this transcript wherever a new topic begins. Use -- as a delimiter.\nTranscript: Let's start with last week's performance metrics. Next, we’ll review upcoming campaign deadlines."
 inputs = tokenizer(text, return_tensors="pt")
 outputs = model.generate(**inputs, max_new_tokens=30000)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 🧾 License
 Released under the MIT License — free for research and commercial use with attribution.

+---
+language: en
+license: mit
+tags:
+- mistral
+- lora
+- peft
+- transcript-chunking
+- text-segmentation
+- topic-detection
+- transformers
+model_type: mistral
+base_model: mistralai/Mistral-7B-v0.2
+datasets:
+- custom-transcript-chunking
+metrics:
+- loss
+- accuracy
+---
+# 🧠 Mistral LoRA Transcript Chunking Model
+## Model Overview
+This LoRA adapter was trained on a custom dataset of **1,000 English transcript examples** to teach a **Mistral-7B-v0.2** model how to segment long transcripts into topic-based chunks using `--` as delimiters.
+It enables automated **topic boundary detection** in conversation, meeting, and podcast transcripts — ideal for preprocessing before summarization, classification, or retrieval.
+---
+## 🧩 Training Objective
 The model learns to:
+- Detect topic changes in unstructured transcripts
+- Insert `--` where those shifts occur
+- Preserve the original flow of speech
+**Example:**
+---
+## ⚙️ Training Configuration
+- **Base Model:** `mistralai/Mistral-7B-v0.2`
+- **Adapter Type:** LoRA
+- **PEFT Library:** `peft==0.10.0`
+- **Training Framework:** Hugging Face Transformers
+- **Epochs:** 2
+- **Optimizer:** AdamW
+- **Learning Rate:** 2e-4
+- **Batch Size:** 8
+- **Sequence Length:** 512
+---
+## 📊 Training Metrics
+| Step | Training Loss | Validation Loss | Entropy | Num Tokens | Mean Token Accuracy |
+|------|----------------|----------------|----------|-------------|---------------------|
+| 100  | 0.2961 | 0.1603 | 0.1644 | 204,800 | 0.9594 |
+| 200  | 0.1362 | 0.1502 | 0.1609 | 409,600 | 0.9603 |
+| 300  | 0.1360 | 0.1451 | 0.1391 | 612,864 | 0.9572 |
+| 400  | 0.0951 | 0.1351 | 0.1279 | 817,664 | 0.9635 |
+| 500  | 0.0947 | 0.1297 | 0.0892 | 1,022,464 | 0.9657 |
+**Summary:**
+Loss steadily decreased during training, and accuracy remained consistently above **95%**, indicating the model effectively learned transcript reconstruction and accurate delimiter placement.
+---
+## 🧰 Usage Example
+```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 from peft import PeftModel
 model = AutoModelForCausalLM.from_pretrained(base)
 model = PeftModel.from_pretrained(model, adapter)
+text = (
+    "Break this transcript wherever a new topic begins. Use -- as a delimiter.\n"
+    "Transcript: Let's start with last week's performance metrics. "
+    "Next, we’ll review upcoming campaign deadlines."
+)
 inputs = tokenizer(text, return_tensors="pt")
 outputs = model.generate(**inputs, max_new_tokens=30000)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 🧾 License
 Released under the MIT License — free for research and commercial use with attribution.