---
license: mit
---
# tinyGemma Urdu
Trained a 0.96 million parameters Urdu Gemma.

- **Gemma Paper**: https://arxiv.org/abs/2503.19786 - Core architecture and design principles
- **RMSNorm**: https://arxiv.org/abs/1910.07467 - Root Mean Square Layer Normalization
- **RoPE**: https://arxiv.org/abs/2104.09864 - Rotary Position Embedding methodology
- **Grouped Query Attention**: https://arxiv.org/abs/2305.13245 - Memory efficient attention mechanism
- **SwiGLU/GELU**: https://arxiv.org/abs/2002.05202 - Gated linear unit activations

## Architecture

A version of Google's Gemma architecture with the following components as defined in `GemmaConfig`:

- **GemmaAttention**: Multi-head attention with grouped query attention (num_queries_per_kv), RoPE positional embeddings via `apply_rotary_emb()`, and causal masking using pre-computed triangular mask
- **GemmaMLP**: Feed-forward network with GELU activation implementing gate_proj * up_proj gating mechanism through down_proj  
- **GemmaDecoderLayer**: Transformer block combining self_attn and mlp with pre-normalization using RMSNorm
- **RMSNorm**: Root Mean Square Layer Normalization with optional unit offset (add_unit_offset=True) and learnable weight parameter
- **tinyGemma**: Complete model with embedder scaled by sqrt(hidden_size) and tied weights for language modeling head
- 
## Training Results

Achieved convergence on Urdu corpus with the following performance metrics:

```
Final Training Metrics (5000 iterations):
- Training Loss: 2.7668
- Validation Loss: 2.9250  
- Validation Perplexity: 18.6348
- Learning Rate: 3e-4 with AdamW optimizer
- Batch Size: 16 with 2 gradient accumulation steps
```


### Loss Curves

![Train and Val loss curves](loss.png)


## License

MIT License