File size: 3,409 Bytes
0cf4344
 
 
 
 
 
2be93d1
0cf4344
 
 
08c9a0c
d47898a
f8fd37e
 
 
0cf4344
 
08c9a0c
 
 
 
 
 
 
 
 
 
d47898a
 
 
 
 
 
08c9a0c
 
 
 
 
2be93d1
08c9a0c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d47898a
 
 
 
 
 
 
 
08c9a0c
 
 
 
 
 
d47898a
08c9a0c
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
title: GigaAMv3 Preview
emoji: 🔥
colorFrom: red
colorTo: green
sdk: gradio
sdk_version: 6.0.0
app_file: app.py
pinned: false
license: mit
short_description: Interactive Gradio Space demonstrating ai-sage/GigaAM-v3 ASR
hf_oauth: true
hf_oauth_scopes:
- read-repos

---

# GigaAM-v3 Gradio demo

This Space demonstrates the [`ai-sage/GigaAM-v3`](https://huggingface.co/ai-sage/GigaAM-v3) Russian ASR models built on top of a Conformer encoder and HuBERT-CTC objective. The demo lets you:

- upload or record audio (WAV/MP3/FLAC) directly in the browser,
- choose between the `ctc`, `rnnt`, `e2e_ctc`, and `e2e_rnnt` checkpoints,
- switch between a fast single-pass mode and a segmented long-form mode that returns timestamps.

The end-to-end variants (`e2e_*`) produce punctuated, normalized text, while the classic CTC/RNN-T checkpoints return raw transcriptions with lower latency. Long-form mode uses `model.transcribe_longform` and requires a Hugging Face token with access to [`pyannote/segmentation-3.0`](https://huggingface.co/pyannote/segmentation-3.0).

**Short-form limits & audio pre-processing**

- `model.transcribe` in GigaAM supports clips roughly up to **25 seconds** despite the UI limit of 150 seconds.
- All incoming audio (upload + microphone) is automatically converted to mono PCM at 16 kHz before inference to match the recommendation from the [official repo](https://github.com/salute-developers/GigaAM/).
- If a clip exceeds the short-form limit, the app transparently switches to segmented mode (requires an auth token) instead of failing with \"Too long wav file\".

## Requirements

- Python 3.10
- PyTorch / torchaudio 2.8.0
- `transformers==4.57.1`
- `gradio==6.0.0` (see `requirements.txt` for the full list)
- Optional: set `HF_TOKEN` (or `HUGGINGFACEHUB_API_TOKEN`) if you want to use the segmented mode or access private weights.

## Running locally

```bash
python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows
pip install -r requirements.txt

# optional – needed for long-form segmentation
export HF_TOKEN=<your_hf_token>

python app.py
```

Open the printed URL (default `http://127.0.0.1:7860`) and start transcribing.

## Authentication & user tokens

This Space enables the Hugging Face OAuth flow (see [Spaces OAuth docs](https://huggingface.co/docs/hub/spaces-oauth)). When you click the \"Sign in with Hugging Face\" button in the UI:

- The returned access token is stored only in your session and used to access `pyannote/segmentation-3.0` for long-form transcription.
- You can sign out at any time, or rely on the space-level `HF_TOKEN` secret if provided by the maintainer.
- Without a token you can still run the short-form mode (<25 s) but segmented transcription is disabled.

## Deploying to Hugging Face Spaces

- Keep the YAML front matter above so Spaces can infer the runtime.
- Upload `app.py`, `requirements.txt`, and `runtime.txt`.
- Configure an `HF_TOKEN` secret in **Settings → Variables** if you want segmented mode to work for everyone.
- Assign `CPU Upgrade` or GPU hardware for heavy, long-form workloads.
- (Optional) Leave `hf_oauth: true` in the metadata to enable the built-in \"Sign in with HF\" button powered by OAuth/OpenID Connect.

For more options (custom hardware, scaling, telemetry), review the [Spaces configuration reference](https://huggingface.co/docs/hub/spaces-config-reference).