tiny-audio / README.md
mazesmazes's picture
Update custom model files, README, and requirements
842b6ba verified
---
license: mit
language:
- en
datasets:
- speechbrain/LoquaciousSet
base_model:
- zai-org/GLM-ASR-Nano-2512
- Qwen/Qwen3-0.6B
pipeline_tag: automatic-speech-recognition
tags:
- asr
- speech-recognition
- audio
- qwen
- glm-asr
library_name: transformers
---
# Tiny Audio
A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with [Tiny Audio](https://github.com/alexkroman/tiny-audio)—a minimal, hackable ASR framework.
## Quick Start
```python
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
result = pipe("audio.wav")
print(result["text"])
```
## Usage Examples
### Basic Transcription
```python
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
# From file
result = pipe("audio.wav")
print(result["text"])
# From URL
result = pipe("https://example.com/audio.mp3")
# From numpy array (must be 16kHz)
import numpy as np
audio = np.random.randn(16000).astype(np.float32) # 1 second
result = pipe(audio)
```
### Batch Processing
```python
# Process multiple files
files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = pipe(files, batch_size=4)
for r in results:
print(r["text"])
```
### Word-Level Timestamps
```python
result = pipe("audio.wav", return_timestamps="word")
# Returns:
# {
# "text": "hello world",
# "chunks": [
# {"text": "hello", "timestamp": (0.0, 0.5)},
# {"text": "world", "timestamp": (0.6, 1.0)}
# ]
# }
```
### Streaming Inference
```python
from tiny_audio import ASRModel, ASRProcessor
import torch
model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
# Load and process audio
import librosa
audio, sr = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
# Stream tokens
for token in model.generate_streaming(inputs["input_features"]):
print(token, end="", flush=True)
```
### Using with torch directly
```python
from tiny_audio import ASRModel, ASRProcessor
import torch
import librosa
# Load model and processor
model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
# Load audio (16kHz)
audio, sr = librosa.load("audio.wav", sr=16000)
# Process
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
# Generate
with torch.no_grad():
output = model.generate(
input_features=inputs["input_features"],
attention_mask=inputs["attention_mask"],
max_new_tokens=256
)
# Decode
text = processor.batch_decode(output, skip_special_tokens=True)[0]
print(text)
```
### GPU Inference
```python
import torch
pipe = pipeline(
"automatic-speech-recognition",
model="mazesmazes/tiny-audio",
trust_remote_code=True,
device="cuda" # or device=0
)
```
### Half Precision
```python
pipe = pipeline(
"automatic-speech-recognition",
model="mazesmazes/tiny-audio",
trust_remote_code=True,
torch_dtype=torch.float16,
device="cuda"
)
```
## Architecture
```
Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text
```
Only the projector is trained (~12M params). The encoder and decoder remain frozen, leveraging their pretrained knowledge.
| Component | Model | Parameters | Status |
|-----------|-------|------------|--------|
| Audio Encoder | GLM-ASR-Nano-2512 | ~600M | Frozen |
| Projector | 2-layer MLP | ~12M | Trained |
| Language Model | Qwen3-0.6B | ~600M | Frozen |
### How It Works
1. **Audio Encoder**: GLM-ASR converts 16kHz audio into frame-level embeddings (768-dim)
2. **Projector**: A 2-layer MLP with frame stacking bridges the audio and text embedding spaces
3. **Language Model**: Qwen3 generates text autoregressively, conditioned on the projected audio
The projector reduces sequence length via frame stacking: `output_len = (input_len - 5) // 5 + 1`
## Model Specifications
| Specification | Value |
|---------------|-------|
| Input | Audio (16kHz mono) |
| Output | Text transcription |
| Max Audio Length | ~30 seconds (limited by encoder) |
| Vocabulary | Qwen3 tokenizer |
| Languages | English only |
| Generation | Greedy decoding (num_beams=1, do_sample=False) |
## Training Details
| | |
|---|---|
| **Dataset** | LoquaciousSet (25,000 hours) |
| **Hardware** | Single NVIDIA A40 |
| **Time** | ~24 hours |
| **Cost** | ~$12 |
| **Optimizer** | AdamW |
| **Learning Rate** | 1e-4 |
| **Batch Size** | 4 |
| **Steps** | 50,000 |
## Limitations
- **English only**: Not trained on other languages
- **Sample rate**: Expects 16kHz audio (other rates resampled automatically)
- **Audio length**: Best for clips under 30 seconds
- **Accuracy**: May degrade on:
- Heavily accented speech
- Noisy or low-quality audio
- Domain-specific terminology
- Overlapping speakers
- **No punctuation**: Output is lowercase without punctuation by default
## Requirements
```
transformers>=4.40.0
torch>=2.0.0
torchaudio>=2.0.0
```
Optional for streaming:
```
librosa
soundfile
```
## Files
| File | Description |
|------|-------------|
| `config.json` | Model configuration |
| `model.safetensors` | Projector weights (~48MB) |
| `preprocessor_config.json` | Audio preprocessing config |
| `tokenizer.json` | Tokenizer |
| `tokenizer_config.json` | Tokenizer config |
| `special_tokens_map.json` | Special tokens |
Note: Only the projector weights are stored. The encoder (GLM-ASR) and decoder (Qwen3) are loaded from their respective HuggingFace repos.
## Citation
If you use this model, please cite:
```bibtex
@misc{tinyaudio2024,
author = {Alex Kroman},
title = {Tiny Audio: Minimal ASR Training},
year = {2024},
publisher = {GitHub},
url = {https://github.com/alexkroman/tiny-audio}
}
```
## Links
- [GitHub Repository](https://github.com/alexkroman/tiny-audio) - Train your own model
- [Free 3.5-hour Course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md) - Learn ASR from scratch
- [Live Demo](https://huggingface.co/spaces/mazesmazes/tiny-audio) - Try it in your browser
## Acknowledgments
- [GLM-ASR](https://huggingface.co/zai-org/GLM-ASR-Nano-2512) for the audio encoder
- [Qwen3](https://huggingface.co/Qwen/Qwen3-0.6B) for the language model
- [LoquaciousSet](https://huggingface.co/datasets/speechbrain/LoquaciousSet) for training data
## License
MIT