|
|
--- |
|
|
language: en |
|
|
license: mit |
|
|
tags: |
|
|
- mistral |
|
|
- lora |
|
|
- peft |
|
|
- transcript-chunking |
|
|
- text-segmentation |
|
|
- topic-detection |
|
|
- transformers |
|
|
model_type: mistral |
|
|
base_model: mistralai/Mistral-7B-v0.2 |
|
|
datasets: |
|
|
- custom-transcript-chunking |
|
|
metrics: |
|
|
- loss |
|
|
- accuracy |
|
|
--- |
|
|
|
|
|
# 🧠 Mistral LoRA Transcript Chunking Model |
|
|
|
|
|
## Model Overview |
|
|
This LoRA adapter was trained on a custom dataset of **1,000 English transcript examples** to teach a **Mistral-7B-v0.2** model how to segment long transcripts into topic-based chunks using 'section #:' as delimiters. |
|
|
It enables automated **topic boundary detection** in conversation, meeting, and podcast transcripts — ideal for preprocessing before summarization, classification, or retrieval. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧩 Training Objective |
|
|
The model learns to: |
|
|
|
|
|
- Detect topic changes in unstructured transcripts |
|
|
- Insert `--` where those shifts occur |
|
|
- Preserve the original flow of speech |
|
|
|
|
|
**Example:** |
|
|
|
|
|
--- |
|
|
|
|
|
## ⚙️ Training Configuration |
|
|
- **Base Model:** `mistralai/Mistral-7B-v0.2` |
|
|
- **Adapter Type:** LoRA |
|
|
- **PEFT Library:** `peft==0.10.0` |
|
|
- **Training Framework:** Hugging Face Transformers |
|
|
- **Epochs:** 2 |
|
|
- **Optimizer:** AdamW |
|
|
- **Learning Rate:** 2e-4 |
|
|
- **Batch Size:** 8 |
|
|
- **Sequence Length:** 512 |
|
|
|
|
|
--- |
|
|
|
|
|
## 📊 Training Metrics |
|
|
|
|
|
| Step | Training Loss | Validation Loss | Entropy | Num Tokens | Mean Token Accuracy | |
|
|
|------|----------------|----------------|----------|-------------|---------------------| |
|
|
| 100 | 0.2961 | 0.1603 | 0.1644 | 204,800 | 0.9594 | |
|
|
| 200 | 0.1362 | 0.1502 | 0.1609 | 409,600 | 0.9603 | |
|
|
| 300 | 0.1360 | 0.1451 | 0.1391 | 612,864 | 0.9572 | |
|
|
| 400 | 0.0951 | 0.1351 | 0.1279 | 817,664 | 0.9635 | |
|
|
| 500 | 0.0947 | 0.1297 | 0.0892 | 1,022,464 | 0.9657 | |
|
|
|
|
|
**Summary:** |
|
|
Loss steadily decreased during training, and accuracy remained consistently above **95%**, indicating the model effectively learned transcript reconstruction and accurate delimiter placement. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧰 Usage Example |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
from peft import PeftModel |
|
|
|
|
|
base = "mistralai/Mistral-7B-Instruct-v0.2" |
|
|
adapter = "Dc-4nderson/transcript_summarizer_model" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(base) |
|
|
model = AutoModelForCausalLM.from_pretrained(base) |
|
|
model = PeftModel.from_pretrained(model, adapter) |
|
|
|
|
|
text = ( |
|
|
"Break this transcript wherever a new topic begins. Use 'section #:' as a delimiter.\n" |
|
|
"Transcript: Let's start with last week's performance metrics. " |
|
|
"Next, we’ll review upcoming campaign deadlines." |
|
|
) |
|
|
|
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_new_tokens=30000) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
|
|
|
``` |
|
|
🧾 License |
|
|
|
|
|
Released under the MIT License — free for research and commercial use with attribution. |
|
|
|
|
|
🙌 Credits |
|
|
|
|
|
Developed by Dequan Anderson for automated transcript segmentation and chunked text preprocessing tasks. |
|
|
Built using Hugging Face Transformers, PEFT, and Mistral 7B for efficient LoRA fine-tuning. |