File size: 3,043 Bytes
34e1b2b 83a1a3f 34e1b2b bf613c6 e146fab 34e1b2b e146fab 34e1b2b e146fab 34e1b2b e146fab 34e1b2b e146fab 34e1b2b e146fab 34e1b2b e146fab 34e1b2b e146fab 34e1b2b e146fab 34e1b2b e146fab 34e1b2b bf613c6 e146fab 7d763f9 bf613c6 e146fab bf613c6 e146fab 34e1b2b 83a1a3f 34e1b2b bf613c6 7d763f9 bf613c6 e146fab bf613c6 e146fab bf613c6 e146fab bf613c6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
---
language: en
license: mit
tags:
- mistral
- lora
- peft
- transcript-chunking
- text-segmentation
- topic-detection
- transformers
model_type: mistral
base_model: mistralai/Mistral-7B-v0.2
datasets:
- custom-transcript-chunking
metrics:
- loss
- accuracy
---
# 🧠 Mistral LoRA Transcript Chunking Model
## Model Overview
This LoRA adapter was trained on a custom dataset of **1,000 English transcript examples** to teach a **Mistral-7B-v0.2** model how to segment long transcripts into topic-based chunks using 'section #:' as delimiters.
It enables automated **topic boundary detection** in conversation, meeting, and podcast transcripts — ideal for preprocessing before summarization, classification, or retrieval.
---
## 🧩 Training Objective
The model learns to:
- Detect topic changes in unstructured transcripts
- Insert `--` where those shifts occur
- Preserve the original flow of speech
**Example:**
---
## ⚙️ Training Configuration
- **Base Model:** `mistralai/Mistral-7B-v0.2`
- **Adapter Type:** LoRA
- **PEFT Library:** `peft==0.10.0`
- **Training Framework:** Hugging Face Transformers
- **Epochs:** 2
- **Optimizer:** AdamW
- **Learning Rate:** 2e-4
- **Batch Size:** 8
- **Sequence Length:** 512
---
## 📊 Training Metrics
| Step | Training Loss | Validation Loss | Entropy | Num Tokens | Mean Token Accuracy |
|------|----------------|----------------|----------|-------------|---------------------|
| 100 | 0.2961 | 0.1603 | 0.1644 | 204,800 | 0.9594 |
| 200 | 0.1362 | 0.1502 | 0.1609 | 409,600 | 0.9603 |
| 300 | 0.1360 | 0.1451 | 0.1391 | 612,864 | 0.9572 |
| 400 | 0.0951 | 0.1351 | 0.1279 | 817,664 | 0.9635 |
| 500 | 0.0947 | 0.1297 | 0.0892 | 1,022,464 | 0.9657 |
**Summary:**
Loss steadily decreased during training, and accuracy remained consistently above **95%**, indicating the model effectively learned transcript reconstruction and accurate delimiter placement.
---
## 🧰 Usage Example
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
base = "mistralai/Mistral-7B-Instruct-v0.2"
adapter = "Dc-4nderson/transcript_summarizer_model"
tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base)
model = PeftModel.from_pretrained(model, adapter)
text = (
"Break this transcript wherever a new topic begins. Use 'section #:' as a delimiter.\n"
"Transcript: Let's start with last week's performance metrics. "
"Next, we’ll review upcoming campaign deadlines."
)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=30000)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
🧾 License
Released under the MIT License — free for research and commercial use with attribution.
🙌 Credits
Developed by Dequan Anderson for automated transcript segmentation and chunked text preprocessing tasks.
Built using Hugging Face Transformers, PEFT, and Mistral 7B for efficient LoRA fine-tuning. |