--- license: mit --- # tinyGemma Urdu Trained a 0.96 million parameters Urdu Gemma. - **Gemma Paper**: https://arxiv.org/abs/2503.19786 - Core architecture and design principles - **RMSNorm**: https://arxiv.org/abs/1910.07467 - Root Mean Square Layer Normalization - **RoPE**: https://arxiv.org/abs/2104.09864 - Rotary Position Embedding methodology - **Grouped Query Attention**: https://arxiv.org/abs/2305.13245 - Memory efficient attention mechanism - **SwiGLU/GELU**: https://arxiv.org/abs/2002.05202 - Gated linear unit activations ## Architecture A version of Google's Gemma architecture with the following components as defined in `GemmaConfig`: - **GemmaAttention**: Multi-head attention with grouped query attention (num_queries_per_kv), RoPE positional embeddings via `apply_rotary_emb()`, and causal masking using pre-computed triangular mask - **GemmaMLP**: Feed-forward network with GELU activation implementing gate_proj * up_proj gating mechanism through down_proj - **GemmaDecoderLayer**: Transformer block combining self_attn and mlp with pre-normalization using RMSNorm - **RMSNorm**: Root Mean Square Layer Normalization with optional unit offset (add_unit_offset=True) and learnable weight parameter - **tinyGemma**: Complete model with embedder scaled by sqrt(hidden_size) and tied weights for language modeling head - ## Training Results Achieved convergence on Urdu corpus with the following performance metrics: ``` Final Training Metrics (5000 iterations): - Training Loss: 2.7668 - Validation Loss: 2.9250 - Validation Perplexity: 18.6348 - Learning Rate: 3e-4 with AdamW optimizer - Batch Size: 16 with 2 gradient accumulation steps ``` ### Loss Curves ![Train and Val loss curves](loss.png) ## License MIT License