📄 Technical Documentation
A full technical report describing the training pipeline, data curation, and evaluations is available here:
👉 Click to view/download the PDF
Small Models, Big Impact: Fine-tuning Lightweight LLMs for Sinhala
Model Description
This model is a fine-tuned version of the Qwen-2.5-1.5B-Instruct large language model, tailored specifically for Sinhala, a low-resource language underrepresented in mainstream NLP ecosystems. The goal is to deliver language modeling capabilities on lightweight models that can run efficiently on constrained hardware such as consumer GPUs or edge devices.
Our development process followed a two-phase pipeline:
- LoRA-based fine-tuning on the earlier Qwen-1.5-1.8B model to prototype efficiently on CPU.
- Full parameter fine-tuning on the Qwen-2.5-1.5B-Instruct checkpoint to maximize language understanding and generation performance.
Key Results
- Perplexity reduced from 4.6 → 3.05 on Sinhala Wikipedia.
- Improved generation quality and contextual reasoning in Sinhala.
Intended Use
- Sinhala language modeling and generation.
- NLP tasks such as summarization, translation, reasoning, classification.
- Designed for resource-constrained environments like mobile devices, offline applications, or local inference systems.
Training Data
- Training corpus: Cleaned Sinhala Wikipedia dump (as of mid-2024).
- Evaluation corpus: A filtered Sinhala subset of the Oscar dataset.
- Planned benchmarks: MMLU for reasoning, BLEU for translation, and F1 for classification tasks.
Training Details
Tokenizer
- Customized Hugging Face tokenizer adapted from Qwen's tokenizer to include Sinhala-specific characters and normalization patterns.
Phase 1: LoRA Fine-Tuning
Platform: CPU (multi-core AMD Ryzen 9) with WSL.
Batch size: 4
Training time: 7–8 days
Optimizer: AdamW
Learning rate:
2e-4withlinearscheduler and1000warmup steps.Precision: FP32 (due to CPU limitations)
LoRA configuration:
r=8alpha=16dropout=0.1- Target modules:
q_proj,v_proj
Phase 2: Full Fine-Tuning
- Platform: GPU (NVIDIA RTX 5090, 24GB VRAM)
- Batch size: 16 (gradient accumulation steps: 2)
- Epochs: 3
- Max sequence length: 2048
- Optimizer: AdamW (
betas=(0.9, 0.98),eps=1e-8) - Learning rate:
5e-5withlinearscheduler and500warmup steps - Precision: bfloat16 (via mixed precision)
- Training duration: 14–18 hours
Environment
- Python 3.12
- Anaconda (virtual environment)
- WSL 2 on Windows 11
- AMD Ryzen 9 CPU, 64GB RAM, RTX 5090
Model Loading
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("iCIIT/general-purpose-model-ris-sinhala-qwen2.5-1.5b-cp")
# Use model for inference
input_text = "මෙය සිංහල භාෂාවෙන් ලියවූ වාක්යයකි"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Performance
- Final perplexity: 3.05 on Sinhala Wikipedia
- Relative improvement: ~33% over initial LoRA checkpoint (4.6)
- Qualitative outputs show improved grammaticality, coherence, and domain-specific fluency.
Limitations
- Training on Colab/Kaggle was extremely slow; local GPU usage was critical.
- Model quantization and ONNX export still in progress.
- Lack of diverse Sinhala evaluation benchmarks limits rigorous comparison.
- Code-mixed Sinhala-English performance not yet tested.
Future Work
- Evaluate with MMLU, BLEU, and F1 metrics on downstream tasks.
- Quantize model to INT8 or 4-bit using tools like BitsAndBytes or ONNX Runtime.
- Export to ONNX and GGUF for fast inference.
- Extend to code-mixed Sinhala-English datasets.
- Distill into smaller variants (300M–500M) for even leaner deployment.
Contributors
- Johan Sofalas
- Inuka Gajanayaka
- Gagani Kulathilaka
- Mithila Coomaraswamy
- Azra Safrullah
- Downloads last month
- 4