Dc-4nderson's picture
Update README.md
7d763f9 verified
---
language: en
license: mit
tags:
- mistral
- lora
- peft
- transcript-chunking
- text-segmentation
- topic-detection
- transformers
model_type: mistral
base_model: mistralai/Mistral-7B-v0.2
datasets:
- custom-transcript-chunking
metrics:
- loss
- accuracy
---
# 🧠 Mistral LoRA Transcript Chunking Model
## Model Overview
This LoRA adapter was trained on a custom dataset of **1,000 English transcript examples** to teach a **Mistral-7B-v0.2** model how to segment long transcripts into topic-based chunks using 'section #:' as delimiters.
It enables automated **topic boundary detection** in conversation, meeting, and podcast transcripts — ideal for preprocessing before summarization, classification, or retrieval.
---
## 🧩 Training Objective
The model learns to:
- Detect topic changes in unstructured transcripts
- Insert `--` where those shifts occur
- Preserve the original flow of speech
**Example:**
---
## ⚙️ Training Configuration
- **Base Model:** `mistralai/Mistral-7B-v0.2`
- **Adapter Type:** LoRA
- **PEFT Library:** `peft==0.10.0`
- **Training Framework:** Hugging Face Transformers
- **Epochs:** 2
- **Optimizer:** AdamW
- **Learning Rate:** 2e-4
- **Batch Size:** 8
- **Sequence Length:** 512
---
## 📊 Training Metrics
| Step | Training Loss | Validation Loss | Entropy | Num Tokens | Mean Token Accuracy |
|------|----------------|----------------|----------|-------------|---------------------|
| 100 | 0.2961 | 0.1603 | 0.1644 | 204,800 | 0.9594 |
| 200 | 0.1362 | 0.1502 | 0.1609 | 409,600 | 0.9603 |
| 300 | 0.1360 | 0.1451 | 0.1391 | 612,864 | 0.9572 |
| 400 | 0.0951 | 0.1351 | 0.1279 | 817,664 | 0.9635 |
| 500 | 0.0947 | 0.1297 | 0.0892 | 1,022,464 | 0.9657 |
**Summary:**
Loss steadily decreased during training, and accuracy remained consistently above **95%**, indicating the model effectively learned transcript reconstruction and accurate delimiter placement.
---
## 🧰 Usage Example
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
base = "mistralai/Mistral-7B-Instruct-v0.2"
adapter = "Dc-4nderson/transcript_summarizer_model"
tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base)
model = PeftModel.from_pretrained(model, adapter)
text = (
"Break this transcript wherever a new topic begins. Use 'section #:' as a delimiter.\n"
"Transcript: Let's start with last week's performance metrics. "
"Next, we’ll review upcoming campaign deadlines."
)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=30000)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
🧾 License
Released under the MIT License — free for research and commercial use with attribution.
🙌 Credits
Developed by Dequan Anderson for automated transcript segmentation and chunked text preprocessing tasks.
Built using Hugging Face Transformers, PEFT, and Mistral 7B for efficient LoRA fine-tuning.