mazesmazes commited on
Commit
370211e
·
verified ·
1 Parent(s): 4cb7c39

Model save

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +119 -67
  3. asr_pipeline.py +2 -1
.gitattributes CHANGED
@@ -1,3 +1,4 @@
1
  *.safetensors filter=lfs diff=lfs merge=lfs -text
2
  *.bin filter=lfs diff=lfs merge=lfs -text
3
  tokenizer_config.json -filter -diff -merge text
 
 
1
  *.safetensors filter=lfs diff=lfs merge=lfs -text
2
  *.bin filter=lfs diff=lfs merge=lfs -text
3
  tokenizer_config.json -filter -diff -merge text
4
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,71 +1,123 @@
1
  ---
2
- license: mit
3
- language:
4
- - en
5
- datasets:
6
- - speechbrain/LoquaciousSet
7
- base_model:
8
- - openai/whisper-large-v3-turbo
9
- - HuggingFaceTB/SmolLM3-3B
10
- pipeline_tag: automatic-speech-recognition
11
  tags:
12
- - asr
13
- - speech-recognition
14
- - audio
15
- - smollm
16
- - whisper
17
- - mlp
18
  ---
19
 
20
- # Tiny Audio
21
-
22
- A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with the [Tiny Audio](https://github.com/alexkroman/tiny-audio) codebase—a minimal, hackable framework for training ASR models.
23
-
24
- ## Architecture
25
-
26
- ```
27
- Audio (16kHz) → Whisper Encoder (frozen) → MLP Projector (trained) → SmolLM3-3B (frozen) → Text
28
- ```
29
-
30
- **MLP Projector:**
31
- - Convolutional downsampling: 4x sequence compression via two stride-2 conv layers
32
- - Linear (1280 → 2048) → GELU → Linear (2048 → 2048)
33
- - Output normalization: RMSNorm
34
-
35
- ## Training Details
36
-
37
- | | |
38
- |---|---|
39
- | **Dataset** | LoquaciousSet (25,000 hours) |
40
- | **Hardware** | Single NVIDIA A40 40GB |
41
- | **Training Time** | ~24 hours |
42
- | **Cost** | ~$12 |
43
- | **Trainable Parameters** | ~12M (projector only) |
44
-
45
- ## Performance
46
-
47
- **Word Error Rate (WER): 12.14%** on LoquaciousSet test set.
48
-
49
-
50
- ## Usage
51
-
52
- ```python
53
- from transformers import pipeline
54
-
55
- pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
56
-
57
- result = pipe("path/to/audio.wav")
58
- print(result["text"])
59
- ```
60
-
61
- ## Limitations
62
-
63
- - English only
64
- - Optimized for 16kHz audio; other sample rates are resampled automatically
65
- - Performance may degrade on heavily accented speech, noisy environments, or domain-specific jargon
66
- - Maximum audio length limited by context window
67
-
68
- ## Learn More
69
-
70
- - **[Train your own model](https://github.com/alexkroman/tiny-audio)** The full codebase with training scripts
71
- - **[Free 3.5-hour course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md)** Build your own ASR system from scratch
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ library_name: transformers
 
 
 
 
 
 
 
 
3
  tags:
4
+ - generated_from_trainer
5
+ model-index:
6
+ - name: tiny-audio
7
+ results: []
 
 
8
  ---
9
 
10
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
11
+ should probably proofread and complete it, then remove this comment. -->
12
+
13
+ # tiny-audio
14
+
15
+ This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
16
+ It achieves the following results on the evaluation set:
17
+ - Loss: 0.2566
18
+
19
+ ## Model description
20
+
21
+ More information needed
22
+
23
+ ## Intended uses & limitations
24
+
25
+ More information needed
26
+
27
+ ## Training and evaluation data
28
+
29
+ More information needed
30
+
31
+ ## Training procedure
32
+
33
+ ### Training hyperparameters
34
+
35
+ The following hyperparameters were used during training:
36
+ - learning_rate: 0.0001
37
+ - train_batch_size: 16
38
+ - eval_batch_size: 16
39
+ - seed: 936
40
+ - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.95) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
41
+ - lr_scheduler_type: cosine
42
+ - lr_scheduler_warmup_steps: 500
43
+ - num_epochs: 1
44
+
45
+ ### Training results
46
+
47
+ | Training Loss | Epoch | Step | Validation Loss |
48
+ |:-------------:|:------:|:-----:|:---------------:|
49
+ | 0.2888 | 0.0149 | 1000 | 0.2819 |
50
+ | 0.3565 | 0.0298 | 2000 | 0.2919 |
51
+ | 0.3189 | 0.0447 | 3000 | 0.2879 |
52
+ | 0.3274 | 0.0596 | 4000 | 0.2929 |
53
+ | 0.3231 | 0.0745 | 5000 | 0.2870 |
54
+ | 0.3270 | 0.0894 | 6000 | 0.2853 |
55
+ | 0.3486 | 0.1043 | 7000 | 0.2860 |
56
+ | 0.3066 | 0.1192 | 8000 | 0.2865 |
57
+ | 0.3487 | 0.1341 | 9000 | 0.2866 |
58
+ | 0.3307 | 0.1490 | 10000 | 0.2871 |
59
+ | 0.3419 | 0.1639 | 11000 | 0.2852 |
60
+ | 0.3601 | 0.1788 | 12000 | 0.2848 |
61
+ | 0.3156 | 0.1936 | 13000 | 0.2860 |
62
+ | 0.3098 | 0.2085 | 14000 | 0.2830 |
63
+ | 0.3133 | 0.2234 | 15000 | 0.2851 |
64
+ | 0.3269 | 0.2383 | 16000 | 0.2826 |
65
+ | 0.3257 | 0.2532 | 17000 | 0.2822 |
66
+ | 0.3281 | 0.2681 | 18000 | 0.2822 |
67
+ | 0.3941 | 0.2830 | 19000 | 0.2813 |
68
+ | 0.3875 | 0.2979 | 20000 | 0.2854 |
69
+ | 0.3214 | 0.3128 | 21000 | 0.2795 |
70
+ | 0.2914 | 0.3277 | 22000 | 0.2792 |
71
+ | 0.2951 | 0.3426 | 23000 | 0.2805 |
72
+ | 0.3343 | 0.3575 | 24000 | 0.2779 |
73
+ | 0.3252 | 0.3724 | 25000 | 0.2771 |
74
+ | 0.3027 | 0.3873 | 26000 | 0.2768 |
75
+ | 0.3287 | 0.4022 | 27000 | 0.2759 |
76
+ | 0.3208 | 0.4171 | 28000 | 0.2749 |
77
+ | 0.3402 | 0.4320 | 29000 | 0.2730 |
78
+ | 0.2928 | 0.4469 | 30000 | 0.2726 |
79
+ | 0.3085 | 0.4618 | 31000 | 0.2737 |
80
+ | 0.3073 | 0.4767 | 32000 | 0.2705 |
81
+ | 0.3471 | 0.4916 | 33000 | 0.2708 |
82
+ | 0.2945 | 0.5065 | 34000 | 0.2690 |
83
+ | 0.3294 | 0.5214 | 35000 | 0.2696 |
84
+ | 0.3095 | 0.5363 | 36000 | 0.2679 |
85
+ | 0.3152 | 0.5512 | 37000 | 0.2659 |
86
+ | 0.3035 | 0.5660 | 38000 | 0.2674 |
87
+ | 0.3342 | 0.5809 | 39000 | 0.2656 |
88
+ | 0.3242 | 0.5958 | 40000 | 0.2653 |
89
+ | 0.2789 | 0.6107 | 41000 | 0.2643 |
90
+ | 0.3082 | 0.6256 | 42000 | 0.2643 |
91
+ | 0.3174 | 0.6405 | 43000 | 0.2633 |
92
+ | 0.2730 | 0.6554 | 44000 | 0.2628 |
93
+ | 0.2934 | 0.6703 | 45000 | 0.2609 |
94
+ | 0.2944 | 0.6852 | 46000 | 0.2606 |
95
+ | 0.3111 | 0.7001 | 47000 | 0.2614 |
96
+ | 0.3431 | 0.7150 | 48000 | 0.2605 |
97
+ | 0.3226 | 0.7299 | 49000 | 0.2601 |
98
+ | 0.2735 | 0.7448 | 50000 | 0.2591 |
99
+ | 0.3208 | 0.7597 | 51000 | 0.2590 |
100
+ | 0.3208 | 0.7746 | 52000 | 0.2584 |
101
+ | 0.3021 | 0.7895 | 53000 | 0.2578 |
102
+ | 0.2730 | 0.8044 | 54000 | 0.2583 |
103
+ | 0.2938 | 0.8193 | 55000 | 0.2581 |
104
+ | 0.2894 | 0.8342 | 56000 | 0.2574 |
105
+ | 0.2781 | 0.8491 | 57000 | 0.2572 |
106
+ | 0.3003 | 0.8640 | 58000 | 0.2568 |
107
+ | 0.2719 | 0.8789 | 59000 | 0.2568 |
108
+ | 0.2878 | 0.8938 | 60000 | 0.2567 |
109
+ | 0.3058 | 0.9087 | 61000 | 0.2568 |
110
+ | 0.3036 | 0.9236 | 62000 | 0.2568 |
111
+ | 0.3050 | 0.9384 | 63000 | 0.2568 |
112
+ | 0.3244 | 0.9533 | 64000 | 0.2567 |
113
+ | 0.3187 | 0.9682 | 65000 | 0.2566 |
114
+ | 0.3016 | 0.9831 | 66000 | 0.2566 |
115
+ | 0.2697 | 0.9980 | 67000 | 0.2566 |
116
+
117
+
118
+ ### Framework versions
119
+
120
+ - Transformers 5.0.0.dev0
121
+ - Pytorch 2.8.0+cu128
122
+ - Datasets 3.6.0
123
+ - Tokenizers 0.22.1
asr_pipeline.py CHANGED
@@ -215,7 +215,8 @@ class SpeakerDiarizer:
215
  # Prepare audio input
216
  if isinstance(audio, np.ndarray):
217
  # pyannote expects {"waveform": tensor, "sample_rate": int}
218
- waveform = torch.from_numpy(audio).unsqueeze(0) # Add channel dim
 
219
  if waveform.dim() == 1:
220
  waveform = waveform.unsqueeze(0)
221
  audio_input = {"waveform": waveform, "sample_rate": sample_rate}
 
215
  # Prepare audio input
216
  if isinstance(audio, np.ndarray):
217
  # pyannote expects {"waveform": tensor, "sample_rate": int}
218
+ # Copy array to ensure it's writable (avoids PyTorch warning)
219
+ waveform = torch.from_numpy(audio.copy()).unsqueeze(0) # Add channel dim
220
  if waveform.dim() == 1:
221
  waveform = waveform.unsqueeze(0)
222
  audio_input = {"waveform": waveform, "sample_rate": sample_rate}