Dc-4nderson
/

transcript_summarizer_model

transcript-chunking

text-segmentation

topic-detection

Model card Files Files and versions

transcript_summarizer_model / README.md

Dc-4nderson's picture

Update README.md

7d763f9 verified about 2 months ago

|

history blame contribute delete

3.04 kB

	---
	language: en
	license: mit
	tags:
	- mistral
	- lora
	- peft
	- transcript-chunking
	- text-segmentation
	- topic-detection
	- transformers
	model_type: mistral
	base_model: mistralai/Mistral-7B-v0.2
	datasets:
	- custom-transcript-chunking
	metrics:
	- loss
	- accuracy
	---

	# 🧠 Mistral LoRA Transcript Chunking Model

	## Model Overview
	This LoRA adapter was trained on a custom dataset of 1,000 English transcript examples to teach a Mistral-7B-v0.2 model how to segment long transcripts into topic-based chunks using 'section #:' as delimiters.
	It enables automated topic boundary detection in conversation, meeting, and podcast transcripts — ideal for preprocessing before summarization, classification, or retrieval.

	---

	## 🧩 Training Objective
	The model learns to:

	- Detect topic changes in unstructured transcripts
	- Insert `--` where those shifts occur
	- Preserve the original flow of speech

	Example:

	---

	## ⚙️ Training Configuration
	- Base Model: `mistralai/Mistral-7B-v0.2`
	- Adapter Type: LoRA
	- PEFT Library: `peft==0.10.0`
	- Training Framework: Hugging Face Transformers
	- Epochs: 2
	- Optimizer: AdamW
	- Learning Rate: 2e-4
	- Batch Size: 8
	- Sequence Length: 512

	---

	## 📊 Training Metrics

	\| Step \| Training Loss \| Validation Loss \| Entropy \| Num Tokens \| Mean Token Accuracy \|
	\|------\|----------------\|----------------\|----------\|-------------\|---------------------\|
	\| 100 \| 0.2961 \| 0.1603 \| 0.1644 \| 204,800 \| 0.9594 \|
	\| 200 \| 0.1362 \| 0.1502 \| 0.1609 \| 409,600 \| 0.9603 \|
	\| 300 \| 0.1360 \| 0.1451 \| 0.1391 \| 612,864 \| 0.9572 \|
	\| 400 \| 0.0951 \| 0.1351 \| 0.1279 \| 817,664 \| 0.9635 \|
	\| 500 \| 0.0947 \| 0.1297 \| 0.0892 \| 1,022,464 \| 0.9657 \|

	Summary:
	Loss steadily decreased during training, and accuracy remained consistently above 95%, indicating the model effectively learned transcript reconstruction and accurate delimiter placement.

	---

	## 🧰 Usage Example
	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from peft import PeftModel

	base = "mistralai/Mistral-7B-Instruct-v0.2"
	adapter = "Dc-4nderson/transcript_summarizer_model"

	tokenizer = AutoTokenizer.from_pretrained(base)
	model = AutoModelForCausalLM.from_pretrained(base)
	model = PeftModel.from_pretrained(model, adapter)

	text = (
	"Break this transcript wherever a new topic begins. Use 'section #:' as a delimiter.\n"
	"Transcript: Let's start with last week's performance metrics. "
	"Next, we’ll review upcoming campaign deadlines."
	)

	inputs = tokenizer(text, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=30000)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))

	```
	🧾 License

	Released under the MIT License — free for research and commercial use with attribution.

	🙌 Credits

	Developed by Dequan Anderson for automated transcript segmentation and chunked text preprocessing tasks.
	Built using Hugging Face Transformers, PEFT, and Mistral 7B for efficient LoRA fine-tuning.