|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
datasets: |
|
|
- speechbrain/LoquaciousSet |
|
|
base_model: |
|
|
- zai-org/GLM-ASR-Nano-2512 |
|
|
- Qwen/Qwen3-0.6B |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
tags: |
|
|
- asr |
|
|
- speech-recognition |
|
|
- audio |
|
|
- qwen |
|
|
- glm-asr |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Tiny Audio |
|
|
|
|
|
A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with [Tiny Audio](https://github.com/alexkroman/tiny-audio)—a minimal, hackable ASR framework. |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True) |
|
|
result = pipe("audio.wav") |
|
|
print(result["text"]) |
|
|
``` |
|
|
|
|
|
## Usage Examples |
|
|
|
|
|
### Basic Transcription |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True) |
|
|
|
|
|
# From file |
|
|
result = pipe("audio.wav") |
|
|
print(result["text"]) |
|
|
|
|
|
# From URL |
|
|
result = pipe("https://example.com/audio.mp3") |
|
|
|
|
|
# From numpy array (must be 16kHz) |
|
|
import numpy as np |
|
|
audio = np.random.randn(16000).astype(np.float32) # 1 second |
|
|
result = pipe(audio) |
|
|
``` |
|
|
|
|
|
### Batch Processing |
|
|
|
|
|
```python |
|
|
# Process multiple files |
|
|
files = ["audio1.wav", "audio2.wav", "audio3.wav"] |
|
|
results = pipe(files, batch_size=4) |
|
|
for r in results: |
|
|
print(r["text"]) |
|
|
``` |
|
|
|
|
|
### Word-Level Timestamps |
|
|
|
|
|
```python |
|
|
result = pipe("audio.wav", return_timestamps="word") |
|
|
# Returns: |
|
|
# { |
|
|
# "text": "hello world", |
|
|
# "chunks": [ |
|
|
# {"text": "hello", "timestamp": (0.0, 0.5)}, |
|
|
# {"text": "world", "timestamp": (0.6, 1.0)} |
|
|
# ] |
|
|
# } |
|
|
``` |
|
|
|
|
|
### Streaming Inference |
|
|
|
|
|
```python |
|
|
from tiny_audio import ASRModel, ASRProcessor |
|
|
import torch |
|
|
|
|
|
model = ASRModel.from_pretrained("mazesmazes/tiny-audio") |
|
|
processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio") |
|
|
|
|
|
# Load and process audio |
|
|
import librosa |
|
|
audio, sr = librosa.load("audio.wav", sr=16000) |
|
|
inputs = processor(audio, sampling_rate=16000, return_tensors="pt") |
|
|
|
|
|
# Stream tokens |
|
|
for token in model.generate_streaming(inputs["input_features"]): |
|
|
print(token, end="", flush=True) |
|
|
``` |
|
|
|
|
|
### Using with torch directly |
|
|
|
|
|
```python |
|
|
from tiny_audio import ASRModel, ASRProcessor |
|
|
import torch |
|
|
import librosa |
|
|
|
|
|
# Load model and processor |
|
|
model = ASRModel.from_pretrained("mazesmazes/tiny-audio") |
|
|
processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio") |
|
|
|
|
|
# Load audio (16kHz) |
|
|
audio, sr = librosa.load("audio.wav", sr=16000) |
|
|
|
|
|
# Process |
|
|
inputs = processor(audio, sampling_rate=16000, return_tensors="pt") |
|
|
|
|
|
# Generate |
|
|
with torch.no_grad(): |
|
|
output = model.generate( |
|
|
input_features=inputs["input_features"], |
|
|
attention_mask=inputs["attention_mask"], |
|
|
max_new_tokens=256 |
|
|
) |
|
|
|
|
|
# Decode |
|
|
text = processor.batch_decode(output, skip_special_tokens=True)[0] |
|
|
print(text) |
|
|
``` |
|
|
|
|
|
### GPU Inference |
|
|
|
|
|
```python |
|
|
import torch |
|
|
|
|
|
pipe = pipeline( |
|
|
"automatic-speech-recognition", |
|
|
model="mazesmazes/tiny-audio", |
|
|
trust_remote_code=True, |
|
|
device="cuda" # or device=0 |
|
|
) |
|
|
``` |
|
|
|
|
|
### Half Precision |
|
|
|
|
|
```python |
|
|
pipe = pipeline( |
|
|
"automatic-speech-recognition", |
|
|
model="mazesmazes/tiny-audio", |
|
|
trust_remote_code=True, |
|
|
torch_dtype=torch.float16, |
|
|
device="cuda" |
|
|
) |
|
|
``` |
|
|
|
|
|
## Architecture |
|
|
|
|
|
``` |
|
|
Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text |
|
|
``` |
|
|
|
|
|
Only the projector is trained (~12M params). The encoder and decoder remain frozen, leveraging their pretrained knowledge. |
|
|
|
|
|
| Component | Model | Parameters | Status | |
|
|
|-----------|-------|------------|--------| |
|
|
| Audio Encoder | GLM-ASR-Nano-2512 | ~600M | Frozen | |
|
|
| Projector | 2-layer MLP | ~12M | Trained | |
|
|
| Language Model | Qwen3-0.6B | ~600M | Frozen | |
|
|
|
|
|
### How It Works |
|
|
|
|
|
1. **Audio Encoder**: GLM-ASR converts 16kHz audio into frame-level embeddings (768-dim) |
|
|
2. **Projector**: A 2-layer MLP with frame stacking bridges the audio and text embedding spaces |
|
|
3. **Language Model**: Qwen3 generates text autoregressively, conditioned on the projected audio |
|
|
|
|
|
The projector reduces sequence length via frame stacking: `output_len = (input_len - 5) // 5 + 1` |
|
|
|
|
|
## Model Specifications |
|
|
|
|
|
| Specification | Value | |
|
|
|---------------|-------| |
|
|
| Input | Audio (16kHz mono) | |
|
|
| Output | Text transcription | |
|
|
| Max Audio Length | ~30 seconds (limited by encoder) | |
|
|
| Vocabulary | Qwen3 tokenizer | |
|
|
| Languages | English only | |
|
|
| Generation | Greedy decoding (num_beams=1, do_sample=False) | |
|
|
|
|
|
## Training Details |
|
|
|
|
|
| | | |
|
|
|---|---| |
|
|
| **Dataset** | LoquaciousSet (25,000 hours) | |
|
|
| **Hardware** | Single NVIDIA A40 | |
|
|
| **Time** | ~24 hours | |
|
|
| **Cost** | ~$12 | |
|
|
| **Optimizer** | AdamW | |
|
|
| **Learning Rate** | 1e-4 | |
|
|
| **Batch Size** | 4 | |
|
|
| **Steps** | 50,000 | |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **English only**: Not trained on other languages |
|
|
- **Sample rate**: Expects 16kHz audio (other rates resampled automatically) |
|
|
- **Audio length**: Best for clips under 30 seconds |
|
|
- **Accuracy**: May degrade on: |
|
|
- Heavily accented speech |
|
|
- Noisy or low-quality audio |
|
|
- Domain-specific terminology |
|
|
- Overlapping speakers |
|
|
- **No punctuation**: Output is lowercase without punctuation by default |
|
|
|
|
|
## Requirements |
|
|
|
|
|
``` |
|
|
transformers>=4.40.0 |
|
|
torch>=2.0.0 |
|
|
torchaudio>=2.0.0 |
|
|
``` |
|
|
|
|
|
Optional for streaming: |
|
|
``` |
|
|
librosa |
|
|
soundfile |
|
|
``` |
|
|
|
|
|
## Files |
|
|
|
|
|
| File | Description | |
|
|
|------|-------------| |
|
|
| `config.json` | Model configuration | |
|
|
| `model.safetensors` | Projector weights (~48MB) | |
|
|
| `preprocessor_config.json` | Audio preprocessing config | |
|
|
| `tokenizer.json` | Tokenizer | |
|
|
| `tokenizer_config.json` | Tokenizer config | |
|
|
| `special_tokens_map.json` | Special tokens | |
|
|
|
|
|
Note: Only the projector weights are stored. The encoder (GLM-ASR) and decoder (Qwen3) are loaded from their respective HuggingFace repos. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{tinyaudio2024, |
|
|
author = {Alex Kroman}, |
|
|
title = {Tiny Audio: Minimal ASR Training}, |
|
|
year = {2024}, |
|
|
publisher = {GitHub}, |
|
|
url = {https://github.com/alexkroman/tiny-audio} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Links |
|
|
|
|
|
- [GitHub Repository](https://github.com/alexkroman/tiny-audio) - Train your own model |
|
|
- [Free 3.5-hour Course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md) - Learn ASR from scratch |
|
|
- [Live Demo](https://huggingface.co/spaces/mazesmazes/tiny-audio) - Try it in your browser |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- [GLM-ASR](https://huggingface.co/zai-org/GLM-ASR-Nano-2512) for the audio encoder |
|
|
- [Qwen3](https://huggingface.co/Qwen/Qwen3-0.6B) for the language model |
|
|
- [LoquaciousSet](https://huggingface.co/datasets/speechbrain/LoquaciousSet) for training data |
|
|
|
|
|
## License |
|
|
|
|
|
MIT |
|
|
|