📄 Technical Documentation

A full technical report describing the training pipeline, data curation, and evaluations is available here:

Small Models, Big Impact: Fine-tuning Lightweight LLMs for Sinhala

Model Description

This model is a fine-tuned version of the Qwen-2.5-1.5B-Instruct large language model, tailored specifically for Sinhala, a low-resource language underrepresented in mainstream NLP ecosystems. The goal is to deliver language modeling capabilities on lightweight models that can run efficiently on constrained hardware such as consumer GPUs or edge devices.

Our development process followed a two-phase pipeline:

LoRA-based fine-tuning on the earlier Qwen-1.5-1.8B model to prototype efficiently on CPU.
Full parameter fine-tuning on the Qwen-2.5-1.5B-Instruct checkpoint to maximize language understanding and generation performance.

Key Results

Perplexity reduced from 4.6 → 3.05 on Sinhala Wikipedia.
Improved generation quality and contextual reasoning in Sinhala.

Intended Use

Sinhala language modeling and generation.
NLP tasks such as summarization, translation, reasoning, classification.
Designed for resource-constrained environments like mobile devices, offline applications, or local inference systems.

Training Data

Training corpus: Cleaned Sinhala Wikipedia dump (as of mid-2024).
Evaluation corpus: A filtered Sinhala subset of the Oscar dataset.
Planned benchmarks: MMLU for reasoning, BLEU for translation, and F1 for classification tasks.

Training Details

Tokenizer

Customized Hugging Face tokenizer adapted from Qwen's tokenizer to include Sinhala-specific characters and normalization patterns.

Phase 1: LoRA Fine-Tuning

Platform: CPU (multi-core AMD Ryzen 9) with WSL.
Batch size: 4
Training time: 7–8 days
Optimizer: AdamW
Learning rate: 2e-4 with linear scheduler and 1000 warmup steps.
Precision: FP32 (due to CPU limitations)
LoRA configuration:
- r=8
- alpha=16
- dropout=0.1
- Target modules: q_proj, v_proj

Phase 2: Full Fine-Tuning

Platform: GPU (NVIDIA RTX 5090, 24GB VRAM)
Batch size: 16 (gradient accumulation steps: 2)
Epochs: 3
Max sequence length: 2048
Optimizer: AdamW (betas=(0.9, 0.98), eps=1e-8)
Learning rate: 5e-5 with linear scheduler and 500 warmup steps
Precision: bfloat16 (via mixed precision)
Training duration: 14–18 hours

Environment

Python 3.12
Anaconda (virtual environment)
WSL 2 on Windows 11
AMD Ryzen 9 CPU, 64GB RAM, RTX 5090

Model Loading

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("iCIIT/general-purpose-model-ris-sinhala-qwen2.5-1.5b-cp")

# Use model for inference
input_text = "මෙය සිංහල භාෂාවෙන් ලියවූ වාක්‍යයකි"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Performance

Final perplexity: 3.05 on Sinhala Wikipedia
Relative improvement: ~33% over initial LoRA checkpoint (4.6)
Qualitative outputs show improved grammaticality, coherence, and domain-specific fluency.

Limitations

Training on Colab/Kaggle was extremely slow; local GPU usage was critical.
Model quantization and ONNX export still in progress.
Lack of diverse Sinhala evaluation benchmarks limits rigorous comparison.
Code-mixed Sinhala-English performance not yet tested.

Future Work

Evaluate with MMLU, BLEU, and F1 metrics on downstream tasks.
Quantize model to INT8 or 4-bit using tools like BitsAndBytes or ONNX Runtime.
Export to ONNX and GGUF for fast inference.
Extend to code-mixed Sinhala-English datasets.
Distill into smaller variants (300M–500M) for even leaner deployment.

Contributors

Johan Sofalas
Inuka Gajanayaka
Gagani Kulathilaka
Mithila Coomaraswamy
Azra Safrullah

Downloads last month: 4

Safetensors

Model size

2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support