Dc-4nderson commited on
Commit
34e1b2b
·
verified ·
1 Parent(s): bf613c6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -45
README.md CHANGED
@@ -1,57 +1,72 @@
1
- This LoRA adapter was trained on a custom dataset of 1,000 English transcript examples to teach a Mistral-7B model how to segment long transcripts into topic-based chunks using -- as delimiters.
2
- It enables automated topic boundary detection in conversation, meeting, and podcast transcripts — ideal for preprocessing before summarization, classification, or retrieval.
3
-
4
- 🧩 Training Objective
5
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  The model learns to:
7
 
8
- Detect topic changes in unstructured transcripts
9
-
10
- Insert -- where those shifts occur
11
-
12
- Preserve the original flow of speech
13
-
14
- Example:
15
-
16
- Input:
17
- Chunk this transcript wherever a new topic begins. Use -- as a delimiter.
18
- Transcript: Welcome everyone to the meeting. Today we'll discuss project updates and next quarter goals.
19
-
20
- Output:
21
- Welcome everyone to the meeting -- Today we'll discuss project updates -- and next quarter goals.
22
 
23
- ⚙️ Training Configuration
24
 
25
- Base Model: mistralai/Mistral-7B-v0.2
26
 
27
- Adapter Type: LoRA
 
 
 
 
 
 
 
 
 
28
 
29
- PEFT Library: peft==0.10.0
30
 
31
- Training Framework: Hugging Face Transformers
32
 
33
- Epochs: 2
 
 
 
 
 
 
34
 
35
- Optimizer: AdamW
 
36
 
37
- Learning Rate: 2e-4
38
 
39
- Batch Size: 8
40
-
41
- Sequence Length: 512
42
-
43
- 📊 Training Metrics
44
- Step Training Loss Validation Loss Entropy Num Tokens Mean Token Accuracy
45
- 100 0.2961 0.1603 0.1644 204,800 0.9594
46
- 200 0.1362 0.1502 0.1609 409,600 0.9603
47
- 300 0.1360 0.1451 0.1391 612,864 0.9572
48
- 400 0.0951 0.1351 0.1279 817,664 0.9635
49
- 500 0.0947 0.1297 0.0892 1,022,464 0.9657
50
-
51
- Summary:
52
- Loss steadily decreased over training, and accuracy remained consistently above 95%, indicating the model effectively learned transcript reconstruction and delimiter placement patterns.
53
-
54
- 🧰 Usage Example
55
  from transformers import AutoTokenizer, AutoModelForCausalLM
56
  from peft import PeftModel
57
 
@@ -62,11 +77,15 @@ tokenizer = AutoTokenizer.from_pretrained(base)
62
  model = AutoModelForCausalLM.from_pretrained(base)
63
  model = PeftModel.from_pretrained(model, adapter)
64
 
65
- text = "Break this transcript wherever a new topic begins. Use -- as a delimiter.\nTranscript: Let's start with last week's performance metrics. Next, we’ll review upcoming campaign deadlines."
 
 
 
 
 
66
  inputs = tokenizer(text, return_tensors="pt")
67
  outputs = model.generate(**inputs, max_new_tokens=30000)
68
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
69
-
70
  🧾 License
71
 
72
  Released under the MIT License — free for research and commercial use with attribution.
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - mistral
6
+ - lora
7
+ - peft
8
+ - transcript-chunking
9
+ - text-segmentation
10
+ - topic-detection
11
+ - transformers
12
+ model_type: mistral
13
+ base_model: mistralai/Mistral-7B-v0.2
14
+ datasets:
15
+ - custom-transcript-chunking
16
+ metrics:
17
+ - loss
18
+ - accuracy
19
+ ---
20
+
21
+ # 🧠 Mistral LoRA Transcript Chunking Model
22
+
23
+ ## Model Overview
24
+ This LoRA adapter was trained on a custom dataset of **1,000 English transcript examples** to teach a **Mistral-7B-v0.2** model how to segment long transcripts into topic-based chunks using `--` as delimiters.
25
+ It enables automated **topic boundary detection** in conversation, meeting, and podcast transcripts — ideal for preprocessing before summarization, classification, or retrieval.
26
+
27
+ ---
28
+
29
+ ## 🧩 Training Objective
30
  The model learns to:
31
 
32
+ - Detect topic changes in unstructured transcripts
33
+ - Insert `--` where those shifts occur
34
+ - Preserve the original flow of speech
 
 
 
 
 
 
 
 
 
 
 
35
 
36
+ **Example:**
37
 
38
+ ---
39
 
40
+ ## ⚙️ Training Configuration
41
+ - **Base Model:** `mistralai/Mistral-7B-v0.2`
42
+ - **Adapter Type:** LoRA
43
+ - **PEFT Library:** `peft==0.10.0`
44
+ - **Training Framework:** Hugging Face Transformers
45
+ - **Epochs:** 2
46
+ - **Optimizer:** AdamW
47
+ - **Learning Rate:** 2e-4
48
+ - **Batch Size:** 8
49
+ - **Sequence Length:** 512
50
 
51
+ ---
52
 
53
+ ## 📊 Training Metrics
54
 
55
+ | Step | Training Loss | Validation Loss | Entropy | Num Tokens | Mean Token Accuracy |
56
+ |------|----------------|----------------|----------|-------------|---------------------|
57
+ | 100 | 0.2961 | 0.1603 | 0.1644 | 204,800 | 0.9594 |
58
+ | 200 | 0.1362 | 0.1502 | 0.1609 | 409,600 | 0.9603 |
59
+ | 300 | 0.1360 | 0.1451 | 0.1391 | 612,864 | 0.9572 |
60
+ | 400 | 0.0951 | 0.1351 | 0.1279 | 817,664 | 0.9635 |
61
+ | 500 | 0.0947 | 0.1297 | 0.0892 | 1,022,464 | 0.9657 |
62
 
63
+ **Summary:**
64
+ Loss steadily decreased during training, and accuracy remained consistently above **95%**, indicating the model effectively learned transcript reconstruction and accurate delimiter placement.
65
 
66
+ ---
67
 
68
+ ## 🧰 Usage Example
69
+ ```python
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  from transformers import AutoTokenizer, AutoModelForCausalLM
71
  from peft import PeftModel
72
 
 
77
  model = AutoModelForCausalLM.from_pretrained(base)
78
  model = PeftModel.from_pretrained(model, adapter)
79
 
80
+ text = (
81
+ "Break this transcript wherever a new topic begins. Use -- as a delimiter.\n"
82
+ "Transcript: Let's start with last week's performance metrics. "
83
+ "Next, we’ll review upcoming campaign deadlines."
84
+ )
85
+
86
  inputs = tokenizer(text, return_tensors="pt")
87
  outputs = model.generate(**inputs, max_new_tokens=30000)
88
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 
89
  🧾 License
90
 
91
  Released under the MIT License — free for research and commercial use with attribution.