🏷️ SABER: Saudi Semantic Embedding Model (v0.1)
🧩 Summary
SABER-v0.1 (Saudi Arabic BERT Embeddings for Retrieval) is a state-of-the-art Saudi dialect semantic embedding model, fine-tuned from SA-BERT using MultipleNegativesRankingLoss (MNLR) and Matryoshka Representation Learning over a large, high-quality Saudi Triplet Dataset spanning 21 real-life Saudi domains.
SABER transforms a standard Masked Language Model (MLM) into a powerful semantic encoder capable of capturing deep contextual meaning across Najdi, Hijazi, Gulf-influenced, and mixed Saudi dialectals.
The model achieves state-of-the-art results across both long-paragraph STS evaluation and triplet margin separation, significantly outperforming strong baselines such as ATM2, GATE, LaBSE, mE5-base, MarBERT, and MiniLM.
🏗️ Architecture & Build Pipeline
SABER utilizes a rigorous two-stage optimization pipeline: first, we adapted MARBERT-V2 via Masked Language Modeling (MLM) on 500k Saudi sentences to create the domain-specialized SA-BERT, followed by deep semantic optimization using MultipleNegativesRankingLoss (MNRL) and Matryoshka Representation Learning on curated triplets to produce the final state-of-the-art embedding model.
SABER is designed for:
- Semantic search
- Retrieval-Augmented Generation (RAG)
- Clustering
- Intent detection
- Semantic similarity
- Document & paragraph embedding
- Ranking and re-ranking systems
- Multi-domain Saudi-language applications
This release is v0.1 — the first public version of SABER.
📌 Model Details
- Model Name: SABER (Saudi Semantic Embedding)
- Version: v0.1
- Base Model: SA-BERT-V1 (AraBERT trained on Saudi data)
- Language: Arabic (Saudi Dialects: Najdi, Hijazi, Gulf)
- Task: Sentence Embeddings, Semantic Similarity, Retrieval
- Training Objective: MNLR + Matryoshka Loss
- Embedding Dimension: 768
- License: Apache 2.0
- Maintainer: Omartificial-Intelligence-Space
🧠 Motivation
Saudi dialect NLP remains an underdeveloped space. Most embeddings struggle with dialectal variation, idiomatic expressions, and multi-sentence reasoning. SABER was designed to fill this gap by:
- Training specifically on Saudi-dialect triplet data.
- Leveraging modern contrastive learning.
- Creating robust embeddings suitable for production and research.
This model is the result of extensive evaluation across STS, triplets, and domain-specific tests.
⚠️ Limitations
- Regional Scope: Performance may degrade on Levantine, Egyptian, or Maghrebi dialects.
- Scope: Embeddings focus on semantic similarity, not syntax or classification.
- Input Length: Long multi-document retrieval requires chunking.
📚 Training Data
SABER was trained on Omartificial-Intelligence-Space/SaudiDialect-Triplet-21, which contains:
- 2964 triplets (Anchor, Positive, Negative)
- 21 domains, including:
- Travel, Food, Shopping, Work & Office, Education, Culture, Weather, Sports, Technology, Medical, Government, Social Events, Anthropology, etc.
- Mixed Saudi dialect sentences (Najdi + Hijazi + Gulf)
- Real-world conversational phrasing
- Carefully curated positive/negative pairs
The dataset includes natural variations in:
- Word choice
- Dialect morphology
- Sentence structure
- Discourse context
- Multi-sentence reasoning
🔧 Training Methodology
SABER was fine-tuned using:
MultipleNegativesRankingLoss (MNLR)
- Transforms the embedding space so similar pairs cluster tightly.
- Each batch uses in-batch negatives, dramatically improving separation.
Matryoshka Representation Learning
- Ensures embeddings remain meaningful across different vector truncation sizes.
Triplet Ranking Optimization
- Anchor–Positive similarity maximized.
- Anchor–Negative similarity minimized.
- Margin-based structure preserved.
Optimizer & Hyperparameters
| Hyperparameter | Value |
|---|---|
| Batch Size | 16 |
| Epochs | 3 |
| Loss | MNLR + Matryoshka |
| Precision | FP16 |
| Negative Sampling | In-batch |
| Gradient Clip | Stable defaults |
| Warmup Ratio | 0.1 |
🧪 Evaluation
SABER was evaluated on two benchmarks:
A) STS Evaluation (Saudi Paragraph-Level Dataset)
Dataset: 1000 samples (0–5 similarity) generated in Saudi dialect.
| Metric | Score |
|---|---|
| Pearson | 0.9189 |
| Spearman | 0.9045 |
| MAE | 1.69 |
| MSE | 3.82 |
These results surpass: ATM2, GATE, LaBSE, MarBERT, mE5-base, and MiniLM.
B) Triplet Evaluation
Triplets derived from STS via (score ≥3 positive, score ≤1 negative).
| Metric | Score |
|---|---|
| Basic Accuracy | 0.9899 |
| Margin > 0.05 | 0.9845 |
| Margin > 0.10 | 0.9781 |
| Margin > 0.20 | 0.9609 |
Excellent separation across strict thresholds.
🔍 Usage Example
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# Load the model
model = SentenceTransformer("Omartificial-Intelligence-Space/Saudi-Semantic-Embedding-v0.1")
# Define sentences (Saudi Dialect)
s1 = "ودي أسافر للرياض الأسبوع الجاي"
s2 = "أفكر أروح الرياض قريب عشان مشوار مهم"
# Encode
e1 = model.encode([s1])
e2 = model.encode([s2])
# Calculate similarity
sim = cosine_similarity(e1, e2)[0][0]
print("Cosine Similarity:", sim)
Training Details
Training Dataset
csv
- Dataset: csv
- Size: 2,964 training samples
- Columns:
text1andtext2 - Approximate statistics based on the first 1000 samples:
text1 text2 type string string details - min: 5 tokens
- mean: 10.36 tokens
- max: 22 tokens
- min: 4 tokens
- mean: 10.28 tokens
- max: 19 tokens
- Samples:
text1 text2 هل فيه رحلات بحرية للأطفال في جدة؟ودي أعرف عن جولات بحرية للأطفال في جدةودي أحجز تذكرة طيران للرياض الأسبوع الجايناوي أشتري تذكرة للرياض الأسبوع الجايعطوني أفضل فندق قريب من مطار جدةأبي فندق قريب من المطار - Loss:
MatryoshkaLosswith these parameters:{ "loss": "MultipleNegativesRankingLoss", "matryoshka_dims": [ 768 ], "matryoshka_weights": [ 1 ], "n_dims_per_step": -1 }
Citation
📌 Commercial Use
Commercial use of this model is not permitted under the CC BY-NC 4.0 license.
For commercial licensing, partnerships, or enterprise use, please contact:
If you use this model in academic work, please cite:
@inproceedings{nacar-saber-2025,
title = "SAUDI ARABIC EMBEDDING MODEL FOR SEMANTIC SIMILARITY AND RETRIEVAL",
author = "Nacar, Omer",
year = "2025",
url = "https://huggingface.co/Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B",
}
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 4
Model tree for Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B
Base model
UBC-NLP/MARBERTv2