File size: 3,043 Bytes
34e1b2b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83a1a3f
34e1b2b
 
 
 
 
bf613c6
e146fab
34e1b2b
 
 
e146fab
34e1b2b
e146fab
34e1b2b
e146fab
34e1b2b
 
 
 
 
 
 
 
 
 
e146fab
34e1b2b
e146fab
34e1b2b
e146fab
34e1b2b
 
 
 
 
 
 
e146fab
34e1b2b
 
e146fab
34e1b2b
e146fab
34e1b2b
 
bf613c6
 
e146fab
7d763f9
bf613c6
e146fab
bf613c6
 
 
e146fab
34e1b2b
83a1a3f
34e1b2b
 
 
 
bf613c6
 
 
7d763f9
 
bf613c6
e146fab
bf613c6
e146fab
bf613c6
e146fab
bf613c6
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
language: en
license: mit
tags:
- mistral
- lora
- peft
- transcript-chunking
- text-segmentation
- topic-detection
- transformers
model_type: mistral
base_model: mistralai/Mistral-7B-v0.2
datasets:
- custom-transcript-chunking
metrics:
- loss
- accuracy
---

# 🧠 Mistral LoRA Transcript Chunking Model

## Model Overview
This LoRA adapter was trained on a custom dataset of **1,000 English transcript examples** to teach a **Mistral-7B-v0.2** model how to segment long transcripts into topic-based chunks using 'section #:' as delimiters.  
It enables automated **topic boundary detection** in conversation, meeting, and podcast transcripts — ideal for preprocessing before summarization, classification, or retrieval.

---

## 🧩 Training Objective
The model learns to:

- Detect topic changes in unstructured transcripts  
- Insert `--` where those shifts occur  
- Preserve the original flow of speech  

**Example:**

---

## ⚙️ Training Configuration
- **Base Model:** `mistralai/Mistral-7B-v0.2`  
- **Adapter Type:** LoRA  
- **PEFT Library:** `peft==0.10.0`  
- **Training Framework:** Hugging Face Transformers  
- **Epochs:** 2  
- **Optimizer:** AdamW  
- **Learning Rate:** 2e-4  
- **Batch Size:** 8  
- **Sequence Length:** 512  

---

## 📊 Training Metrics

| Step | Training Loss | Validation Loss | Entropy | Num Tokens | Mean Token Accuracy |
|------|----------------|----------------|----------|-------------|---------------------|
| 100  | 0.2961 | 0.1603 | 0.1644 | 204,800 | 0.9594 |
| 200  | 0.1362 | 0.1502 | 0.1609 | 409,600 | 0.9603 |
| 300  | 0.1360 | 0.1451 | 0.1391 | 612,864 | 0.9572 |
| 400  | 0.0951 | 0.1351 | 0.1279 | 817,664 | 0.9635 |
| 500  | 0.0947 | 0.1297 | 0.0892 | 1,022,464 | 0.9657 |

**Summary:**  
Loss steadily decreased during training, and accuracy remained consistently above **95%**, indicating the model effectively learned transcript reconstruction and accurate delimiter placement.

---

## 🧰 Usage Example
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base = "mistralai/Mistral-7B-Instruct-v0.2"
adapter = "Dc-4nderson/transcript_summarizer_model"

tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base)
model = PeftModel.from_pretrained(model, adapter)

text = (
    "Break this transcript wherever a new topic begins. Use 'section #:' as a delimiter.\n"
    "Transcript: Let's start with last week's performance metrics. "
    "Next, we’ll review upcoming campaign deadlines."
)

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=30000)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

```
🧾 License

Released under the MIT License — free for research and commercial use with attribution.

🙌 Credits

Developed by Dequan Anderson for automated transcript segmentation and chunked text preprocessing tasks.
Built using Hugging Face Transformers, PEFT, and Mistral 7B for efficient LoRA fine-tuning.