π§ Mistral LoRA Transcript Chunking Model
Model Overview
This LoRA adapter was trained on a custom dataset of 1,000 English transcript examples to teach a Mistral-7B-v0.2 model how to segment long transcripts into topic-based chunks using 'section #:' as delimiters.
It enables automated topic boundary detection in conversation, meeting, and podcast transcripts β ideal for preprocessing before summarization, classification, or retrieval.
π§© Training Objective
The model learns to:
- Detect topic changes in unstructured transcripts
- Insert
-- where those shifts occur
- Preserve the original flow of speech
Example:
βοΈ Training Configuration
- Base Model:
mistralai/Mistral-7B-v0.2
- Adapter Type: LoRA
- PEFT Library:
peft==0.10.0
- Training Framework: Hugging Face Transformers
- Epochs: 2
- Optimizer: AdamW
- Learning Rate: 2e-4
- Batch Size: 8
- Sequence Length: 512
π Training Metrics
| Step |
Training Loss |
Validation Loss |
Entropy |
Num Tokens |
Mean Token Accuracy |
| 100 |
0.2961 |
0.1603 |
0.1644 |
204,800 |
0.9594 |
| 200 |
0.1362 |
0.1502 |
0.1609 |
409,600 |
0.9603 |
| 300 |
0.1360 |
0.1451 |
0.1391 |
612,864 |
0.9572 |
| 400 |
0.0951 |
0.1351 |
0.1279 |
817,664 |
0.9635 |
| 500 |
0.0947 |
0.1297 |
0.0892 |
1,022,464 |
0.9657 |
Summary:
Loss steadily decreased during training, and accuracy remained consistently above 95%, indicating the model effectively learned transcript reconstruction and accurate delimiter placement.
π§° Usage Example
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
base = "mistralai/Mistral-7B-Instruct-v0.2"
adapter = "Dc-4nderson/transcript_summarizer_model"
tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base)
model = PeftModel.from_pretrained(model, adapter)
text = (
"Break this transcript wherever a new topic begins. Use 'section #:' as a delimiter.\n"
"Transcript: Let's start with last week's performance metrics. "
"Next, weβll review upcoming campaign deadlines."
)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=30000)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π§Ύ License
Released under the MIT License β free for research and commercial use with attribution.
π Credits
Developed by Dequan Anderson for automated transcript segmentation and chunked text preprocessing tasks.
Built using Hugging Face Transformers, PEFT, and Mistral 7B for efficient LoRA fine-tuning.