Urdu-BERT Pretraining (PyTorch)

I implemented BERT model from scratch with Pytorch on Urdu Data. It uses Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks, just like the original BERT model.


βœ… Features

  • Trained on Urdu-1M-news-text
  • Custom WordPiece tokenizer
  • Multi-head attention & transformer encoder blocks
  • NSP and MLM heads
  • Uses PyTorch and HuggingFace Tokenizers
  • Training tracked with Weights & Biases (WandB)

βš™οΈ Training Setup

Setting Value
Epochs 20
Batch Size 64
Sequence Length 64 tokens
Embedding Size 128
Encoder Layers 2
Attention Heads 2
Max LR 2.5e-5
Warmup Steps 1000
Optimizer Adam

πŸ“Š Training Loss (WandB)

Below is an example of the loss curve during training:

WandB Loss Graph

This shows MLM loss, NSP loss, and total loss reducing over time.

πŸ“Œ Notes

  • NSP uses random sentence pairs for false samples
  • Positional embeddings are created using sin/cos
  • Custom learning rate scheduler with warm-up
  • Model is built completely from scratch (no pretrained weights)

πŸ“š References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support