---
license: mit
language:
- ur
---
# Urdu-BERT Pretraining (PyTorch)

I implemented BERT model from scratch with Pytorch on Urdu Data. It uses Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks, just like the original BERT model.

---

## ✅ Features

- Trained on [Urdu-1M-news-text](https://huggingface.co/datasets/El-chapoo/Urdu-1M-news-text)
- Custom WordPiece tokenizer
- Multi-head attention & transformer encoder blocks
- NSP and MLM heads
- Uses PyTorch and HuggingFace Tokenizers
- Training tracked with Weights & Biases (WandB)

---
## ⚙️ Training Setup

| Setting            | Value       |
|--------------------|-------------|
| Epochs             | 20          |
| Batch Size         | 64          |
| Sequence Length    | 64 tokens   |
| Embedding Size     | 128         |
| Encoder Layers     | 2           |
| Attention Heads    | 2           |
| Max LR             | 2.5e-5      |
| Warmup Steps       | 1000        |
| Optimizer          | Adam        |

---
## 📊 Training Loss (WandB)

Below is an example of the loss curve during training:

![WandB Loss Graph](loss.png)

> This shows MLM loss, NSP loss, and total loss reducing over time.

## 📌 Notes

- NSP uses random sentence pairs for false samples
- Positional embeddings are created using sin/cos
- Custom learning rate scheduler with warm-up
- Model is built completely from scratch (no pretrained weights)

## 📚 References

- [BERT from Scratch (Medium)](https://medium.com/data-and-beyond/complete-guide-to-building-bert-model-from-sratch-3e6562228891)
- [PyTorch BERT Tutorial](https://ai.plainenglish.io/bert-pytorch-implementation-prepare-dataset-part-1-efd259113e5a)