📄 Technical Documentation

A full technical report describing the training pipeline, data curation, and evaluations is available here:

👉 Click to view/download the PDF

Small Models, Big Impact: Fine-tuning Lightweight LLMs for Sinhala

Model Description

This model is a fine-tuned version of the Qwen-2.5-1.5B-Instruct large language model, tailored specifically for Sinhala, a low-resource language underrepresented in mainstream NLP ecosystems. The goal is to deliver language modeling capabilities on lightweight models that can run efficiently on constrained hardware such as consumer GPUs or edge devices.

Our development process followed a two-phase pipeline:

  1. LoRA-based fine-tuning on the earlier Qwen-1.5-1.8B model to prototype efficiently on CPU.
  2. Full parameter fine-tuning on the Qwen-2.5-1.5B-Instruct checkpoint to maximize language understanding and generation performance.

Key Results

  • Perplexity reduced from 4.6 → 3.05 on Sinhala Wikipedia.
  • Improved generation quality and contextual reasoning in Sinhala.

Intended Use

  • Sinhala language modeling and generation.
  • NLP tasks such as summarization, translation, reasoning, classification.
  • Designed for resource-constrained environments like mobile devices, offline applications, or local inference systems.

Training Data

  • Training corpus: Cleaned Sinhala Wikipedia dump (as of mid-2024).
  • Evaluation corpus: A filtered Sinhala subset of the Oscar dataset.
  • Planned benchmarks: MMLU for reasoning, BLEU for translation, and F1 for classification tasks.

Training Details

Tokenizer

  • Customized Hugging Face tokenizer adapted from Qwen's tokenizer to include Sinhala-specific characters and normalization patterns.

Phase 1: LoRA Fine-Tuning

  • Platform: CPU (multi-core AMD Ryzen 9) with WSL.

  • Batch size: 4

  • Training time: 7–8 days

  • Optimizer: AdamW

  • Learning rate: 2e-4 with linear scheduler and 1000 warmup steps.

  • Precision: FP32 (due to CPU limitations)

  • LoRA configuration:

    • r=8
    • alpha=16
    • dropout=0.1
    • Target modules: q_proj, v_proj

Phase 2: Full Fine-Tuning

  • Platform: GPU (NVIDIA RTX 5090, 24GB VRAM)
  • Batch size: 16 (gradient accumulation steps: 2)
  • Epochs: 3
  • Max sequence length: 2048
  • Optimizer: AdamW (betas=(0.9, 0.98), eps=1e-8)
  • Learning rate: 5e-5 with linear scheduler and 500 warmup steps
  • Precision: bfloat16 (via mixed precision)
  • Training duration: 14–18 hours

Environment

  • Python 3.12
  • Anaconda (virtual environment)
  • WSL 2 on Windows 11
  • AMD Ryzen 9 CPU, 64GB RAM, RTX 5090

Model Loading

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("iCIIT/general-purpose-model-ris-sinhala-qwen2.5-1.5b-cp")

# Use model for inference
input_text = "මෙය සිංහල භාෂාවෙන් ලියවූ වාක්‍යයකි"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Performance

  • Final perplexity: 3.05 on Sinhala Wikipedia
  • Relative improvement: ~33% over initial LoRA checkpoint (4.6)
  • Qualitative outputs show improved grammaticality, coherence, and domain-specific fluency.

Limitations

  • Training on Colab/Kaggle was extremely slow; local GPU usage was critical.
  • Model quantization and ONNX export still in progress.
  • Lack of diverse Sinhala evaluation benchmarks limits rigorous comparison.
  • Code-mixed Sinhala-English performance not yet tested.

Future Work

  • Evaluate with MMLU, BLEU, and F1 metrics on downstream tasks.
  • Quantize model to INT8 or 4-bit using tools like BitsAndBytes or ONNX Runtime.
  • Export to ONNX and GGUF for fast inference.
  • Extend to code-mixed Sinhala-English datasets.
  • Distill into smaller variants (300M–500M) for even leaner deployment.

Contributors

  • Johan Sofalas
  • Inuka Gajanayaka
  • Gagani Kulathilaka
  • Mithila Coomaraswamy
  • Azra Safrullah

Downloads last month
4
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support