Ar4ikov commited on
Commit
08c9a0c
·
1 Parent(s): 0cf4344

Add Gradio demo application for GigaAM-v3 speech recognition models

Browse files

- Implemented main application logic in app.py for audio transcription using various model variants.
- Updated README.md to reflect the new demo features and usage instructions.
- Added requirements.txt for necessary dependencies.
- Included runtime.txt specifying Python version.

Files changed (4) hide show
  1. README.md +43 -3
  2. app.py +254 -0
  3. requirements.txt +14 -0
  4. runtime.txt +2 -0
README.md CHANGED
@@ -4,11 +4,51 @@ emoji: 🔥
4
  colorFrom: red
5
  colorTo: green
6
  sdk: gradio
7
- sdk_version: 6.0.0
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
- short_description: A test Gradio space for showcase the capabilitie of GigaAMv3
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  colorFrom: red
5
  colorTo: green
6
  sdk: gradio
7
+ sdk_version: 4.44.0
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
+ short_description: Interactive Gradio Space demonstrating ai-sage/GigaAM-v3 ASR
12
  ---
13
 
14
+ # GigaAM-v3 Gradio demo
15
+
16
+ This Space demonstrates the [`ai-sage/GigaAM-v3`](https://huggingface.co/ai-sage/GigaAM-v3) Russian ASR models built on top of a Conformer encoder and HuBERT-CTC objective. The demo lets you:
17
+
18
+ - upload or record audio (WAV/MP3/FLAC) directly in the browser,
19
+ - choose between the `ctc`, `rnnt`, `e2e_ctc`, and `e2e_rnnt` checkpoints,
20
+ - switch between a fast single-pass mode and a segmented long-form mode that returns timestamps.
21
+
22
+ The end-to-end variants (`e2e_*`) produce punctuated, normalized text, while the classic CTC/RNN-T checkpoints return raw transcriptions with lower latency. Long-form mode uses `model.transcribe_longform` and requires a Hugging Face token with access to [`pyannote/segmentation-3.0`](https://huggingface.co/pyannote/segmentation-3.0).
23
+
24
+ ## Requirements
25
+
26
+ - Python 3.10
27
+ - PyTorch / torchaudio 2.8.0
28
+ - `transformers==4.57.1`
29
+ - `gradio==4.44.0` (see `requirements.txt` for the full list)
30
+ - Optional: set `HF_TOKEN` (or `HUGGINGFACEHUB_API_TOKEN`) if you want to use the segmented mode or access private weights.
31
+
32
+ ## Running locally
33
+
34
+ ```bash
35
+ python -m venv .venv
36
+ source .venv/bin/activate # or .venv\Scripts\activate on Windows
37
+ pip install -r requirements.txt
38
+
39
+ # optional – needed for long-form segmentation
40
+ export HF_TOKEN=<your_hf_token>
41
+
42
+ python app.py
43
+ ```
44
+
45
+ Open the printed URL (default `http://127.0.0.1:7860`) and start transcribing.
46
+
47
+ ## Deploying to Hugging Face Spaces
48
+
49
+ - Keep the YAML front matter above so Spaces can infer the runtime.
50
+ - Upload `app.py`, `requirements.txt`, and `runtime.txt`.
51
+ - Configure an `HF_TOKEN` secret in **Settings → Variables** if you want segmented mode to work for everyone.
52
+ - Assign `CPU Upgrade` or GPU hardware for heavy, long-form workloads.
53
+
54
+ For more options (custom hardware, scaling, telemetry), review the [Spaces configuration reference](https://huggingface.co/docs/hub/spaces-config-reference).
app.py ADDED
@@ -0,0 +1,254 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Gradio demo application for the GigaAM-v3 speech recognition models.
3
+ """
4
+ from __future__ import annotations
5
+
6
+ import os
7
+ import threading
8
+ import time
9
+ from typing import Dict, List, Optional
10
+
11
+ import gradio as gr
12
+ import soundfile as sf
13
+ import torch
14
+ from transformers import AutoModel
15
+
16
+ REPO_ID = "ai-sage/GigaAM-v3"
17
+
18
+ MODEL_VARIANTS: Dict[str, str] = {
19
+ "e2e_rnnt": "End-to-end RNN-T • punctuation + normalization (best quality)",
20
+ "e2e_ctc": "End-to-end CTC • punctuation + normalization (faster)",
21
+ "rnnt": "RNN-T decoder • raw text without normalization",
22
+ "ctc": "CTC decoder • fastest baseline",
23
+ }
24
+ DEFAULT_VARIANT = "e2e_rnnt"
25
+
26
+ MAX_SHORT_SECONDS = float(os.getenv("MAX_AUDIO_DURATION_SECONDS", 150))
27
+ MAX_LONG_SECONDS = float(os.getenv("MAX_LONGFORM_DURATION_SECONDS", 600))
28
+
29
+ OUTPUT_MODES = {
30
+ "Short clip (<=150 s)": {
31
+ "id": "short",
32
+ "longform": False,
33
+ "max_duration": MAX_SHORT_SECONDS,
34
+ "limit_msg": "Запись длиннее 150 секунд. Выберите режим 'Segmented long-form' для более длинных файлов.",
35
+ "description": "Single call to `model.transcribe`; best latency for concise utterances.",
36
+ "requires_token": False,
37
+ },
38
+ "Segmented long-form (<=10 min)": {
39
+ "id": "longform",
40
+ "longform": True,
41
+ "max_duration": MAX_LONG_SECONDS,
42
+ "limit_msg": "Длина аудио превышает 10 минут. Сократите запись для сегментированного режима.",
43
+ "description": "Calls `model.transcribe_longform` to obtain timestamped segments.",
44
+ "requires_token": True,
45
+ },
46
+ }
47
+ DEFAULT_MODE_LABEL = next(iter(OUTPUT_MODES))
48
+
49
+ HF_TOKEN = os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACEHUB_API_TOKEN")
50
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
51
+
52
+ MODEL_CACHE: Dict[str, AutoModel] = {}
53
+ MODEL_LOCKS = {variant: threading.Lock() for variant in MODEL_VARIANTS}
54
+
55
+
56
+ def _format_seconds(value: float) -> str:
57
+ return f"{value:.2f}s"
58
+
59
+
60
+ def _read_audio_stats(audio_path: str) -> tuple[float, int]:
61
+ """Return duration (seconds) and sample rate."""
62
+ data, sample_rate = sf.read(audio_path)
63
+ duration = len(data) / float(sample_rate)
64
+ return duration, int(sample_rate)
65
+
66
+
67
+ def _normalize_text(text: object) -> str:
68
+ if text is None:
69
+ return ""
70
+ if isinstance(text, str):
71
+ return text.strip()
72
+ if isinstance(text, dict):
73
+ for key in ("transcription", "text"):
74
+ if key in text and isinstance(text[key], str):
75
+ return text[key].strip()
76
+ return str(text)
77
+
78
+
79
+ def load_model(variant: str) -> AutoModel:
80
+ if variant not in MODEL_VARIANTS:
81
+ raise gr.Error(f"Вариант модели '{variant}' не поддерживается.")
82
+
83
+ if variant in MODEL_CACHE:
84
+ return MODEL_CACHE[variant]
85
+
86
+ lock = MODEL_LOCKS[variant]
87
+ with lock:
88
+ if variant in MODEL_CACHE:
89
+ return MODEL_CACHE[variant]
90
+
91
+ load_kwargs = dict(revision=variant, trust_remote_code=True)
92
+ if HF_TOKEN:
93
+ load_kwargs["token"] = HF_TOKEN
94
+
95
+ model = AutoModel.from_pretrained(REPO_ID, **load_kwargs)
96
+
97
+ try:
98
+ model.to(DEVICE)
99
+ except Exception:
100
+ # Some remote implementations manage their own device placement.
101
+ pass
102
+
103
+ MODEL_CACHE[variant] = model
104
+ return model
105
+
106
+
107
+ def transcribe_audio(
108
+ audio_path: Optional[str],
109
+ variant: str,
110
+ mode_label: str,
111
+ ) -> tuple[str, List[List[float | str]], str]:
112
+ if not audio_path or not os.path.exists(audio_path):
113
+ raise gr.Error("Загрузите или запишите аудиофайл, чтобы начать распознавание.")
114
+
115
+ if mode_label not in OUTPUT_MODES:
116
+ raise gr.Error("Выберите режим транскрипции.")
117
+ mode_cfg = OUTPUT_MODES[mode_label]
118
+
119
+ duration, sample_rate = _read_audio_stats(audio_path)
120
+ if duration < 0.3:
121
+ raise gr.Error("Запись слишком короткая (<300 мс).")
122
+
123
+ if duration > mode_cfg["max_duration"]:
124
+ raise gr.Error(mode_cfg["limit_msg"])
125
+
126
+ if mode_cfg["requires_token"] and not HF_TOKEN:
127
+ raise gr.Error(
128
+ "Для сегментированного режима требуется переменная окружения HF_TOKEN "
129
+ "с доступом к модели 'pyannote/segmentation-3.0'."
130
+ )
131
+
132
+ progress = gr.Progress(track_tqdm=False)
133
+ progress(0.1, desc="Загрузка модели")
134
+ model = load_model(variant)
135
+
136
+ start_ts = time.perf_counter()
137
+ progress(0.55, desc="Распознавание речи")
138
+
139
+ if mode_cfg["longform"]:
140
+ utterances = model.transcribe_longform(audio_path)
141
+ segments: List[List[float | str]] = []
142
+ assembled_text_parts: List[str] = []
143
+ for utt in utterances:
144
+ text = _normalize_text(utt)
145
+ if isinstance(utt, dict):
146
+ boundaries = utt.get("boundaries") or utt.get("timestamps")
147
+ else:
148
+ boundaries = None
149
+ if not boundaries:
150
+ boundaries = (0.0, 0.0)
151
+ start, end = boundaries
152
+ segments.append([round(float(start), 2), round(float(end), 2), text])
153
+ assembled_text_parts.append(text)
154
+ transcription_text = "\n".join(assembled_text_parts).strip()
155
+ else:
156
+ result = model.transcribe(audio_path)
157
+ transcription_text = _normalize_text(result)
158
+ segments = []
159
+
160
+ latency = time.perf_counter() - start_ts
161
+ progress(1.0, desc="Готово")
162
+
163
+ metadata_lines = [
164
+ f"- **Model variant:** {MODEL_VARIANTS[variant]}",
165
+ f"- **Transcription mode:** {mode_cfg['description']}",
166
+ f"- **Audio duration:** {_format_seconds(duration)} @ {sample_rate} Hz",
167
+ f"- **Latency:** {_format_seconds(latency)} on `{DEVICE}`",
168
+ f"- **HF token configured:** {'yes' if HF_TOKEN else 'no'}",
169
+ ]
170
+
171
+ return transcription_text, segments, "\n".join(metadata_lines)
172
+
173
+
174
+ DESCRIPTION_MD = """
175
+ # GigaAM-v3 · Russian ASR demo
176
+
177
+ This Space showcases the [`ai-sage/GigaAM-v3`](https://huggingface.co/ai-sage/GigaAM-v3) Conformer-based models.
178
+
179
+ - Upload or record Russian audio (WAV/MP3/FLAC, mono preferred).
180
+ - Pick the model variant and transcription mode that matches your latency/quality needs.
181
+ - Long-form mode returns timestamped segments and requires an `HF_TOKEN` with access to `pyannote/segmentation-3.0`.
182
+ """
183
+
184
+ FOOTER_MD = """
185
+ **Tips**
186
+
187
+ - Short clips (<150s) work best with the E2E variants (they include punctuation and normalization).
188
+ - Long recordings can take several minutes on CPU-only Spaces; switch to GPU hardware if available.
189
+ - Source: [salute-developers/GigaAM](https://github.com/salute-developers/GigaAM)
190
+ """
191
+
192
+
193
+ def build_interface() -> gr.Blocks:
194
+ with gr.Blocks(title="GigaAM-v3 ASR demo") as demo:
195
+ gr.Markdown(DESCRIPTION_MD)
196
+
197
+ with gr.Row(equal_height=True):
198
+ audio_input = gr.Audio(
199
+ sources=["microphone", "upload"],
200
+ type="filepath",
201
+ label="Russian audio",
202
+ waveform_options=gr.WaveformOptions(
203
+ show_controls=True,
204
+ waveform_color="#f97316",
205
+ skip_length=2,
206
+ ),
207
+ )
208
+
209
+ with gr.Column():
210
+ variant_dropdown = gr.Dropdown(
211
+ choices=list(MODEL_VARIANTS.keys()),
212
+ value=DEFAULT_VARIANT,
213
+ label="Model variant",
214
+ info="End-to-end variants add punctuation; base CTC/RNNT are lighter but raw.",
215
+ )
216
+ mode_radio = gr.Radio(
217
+ choices=list(OUTPUT_MODES.keys()),
218
+ value=DEFAULT_MODE_LABEL,
219
+ label="Transcription mode",
220
+ info="Select segmented mode for >150 second clips (requires HF token).",
221
+ )
222
+ transcribe_btn = gr.Button("Transcribe", variant="primary")
223
+
224
+ transcript_output = gr.Textbox(
225
+ label="Transcript",
226
+ placeholder="Model output will appear here…",
227
+ lines=8,
228
+ )
229
+
230
+ segments_output = gr.Dataframe(
231
+ headers=["Start (s)", "End (s)", "Utterance"],
232
+ datatype=["number", "number", "str"],
233
+ label="Segments (long-form mode)",
234
+ interactive=False,
235
+ )
236
+
237
+ metadata_output = gr.Markdown()
238
+ gr.Markdown(FOOTER_MD)
239
+
240
+ transcribe_btn.click(
241
+ fn=transcribe_audio,
242
+ inputs=[audio_input, variant_dropdown, mode_radio],
243
+ outputs=[transcript_output, segments_output, metadata_output],
244
+ api_name="transcribe",
245
+ )
246
+
247
+ return demo
248
+
249
+
250
+ demo = build_interface()
251
+
252
+ if __name__ == "__main__":
253
+ demo.launch()
254
+
requirements.txt ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ torch==2.8.0
2
+ torchaudio==2.8.0
3
+ transformers==4.57.1
4
+ gradio==4.44.0
5
+ soundfile>=0.12.1
6
+ numpy>=1.26.4
7
+ hydra-core>=1.3.2
8
+ omegaconf>=2.3.0
9
+ sentencepiece>=0.1.99
10
+ pyannote.audio==4.0.0
11
+ torchcodec==0.7.0
12
+ accelerate>=0.34.2
13
+ huggingface_hub>=0.25.2
14
+
runtime.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ python-3.10
2
+