Urdu-BERT Pretraining (PyTorch)
I implemented BERT model from scratch with Pytorch on Urdu Data. It uses Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks, just like the original BERT model.
β Features
- Trained on Urdu-1M-news-text
- Custom WordPiece tokenizer
- Multi-head attention & transformer encoder blocks
- NSP and MLM heads
- Uses PyTorch and HuggingFace Tokenizers
- Training tracked with Weights & Biases (WandB)
βοΈ Training Setup
| Setting | Value |
|---|---|
| Epochs | 20 |
| Batch Size | 64 |
| Sequence Length | 64 tokens |
| Embedding Size | 128 |
| Encoder Layers | 2 |
| Attention Heads | 2 |
| Max LR | 2.5e-5 |
| Warmup Steps | 1000 |
| Optimizer | Adam |
π Training Loss (WandB)
Below is an example of the loss curve during training:
This shows MLM loss, NSP loss, and total loss reducing over time.
π Notes
- NSP uses random sentence pairs for false samples
- Positional embeddings are created using sin/cos
- Custom learning rate scheduler with warm-up
- Model is built completely from scratch (no pretrained weights)
π References
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
