Turkish Sentence Encoder
A Turkish sentence embedding model trained with contrastive learning (InfoNCE loss) on Turkish paraphrase pairs.
Model Description
This model encodes Turkish sentences into 512-dimensional dense vectors that can be used for:
- Semantic similarity
- Semantic search / retrieval
- Clustering
- Paraphrase detection
Usage
Using with custom code
import torch
from transformers import AutoModel, AutoTokenizer
# Load model
model = AutoModel.from_pretrained("Basar2004/turkish-sentence-encoder", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("Basar2004/turkish-sentence-encoder")
# Encode sentences
sentences = ["Bugün hava çok güzel.", "Hava bugün oldukça hoş."]
inputs = tokenizer(sentences, padding=True, truncation=True, max_length=64, return_tensors="pt")
with torch.no_grad():
embeddings = model(**inputs)
# Compute similarity
from torch.nn.functional import cosine_similarity
similarity = cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1].unsqueeze(0))
print(f"Similarity: {similarity.item():.4f}")
Using with Sentence-Transformers (after installing custom wrapper)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Basar2004/turkish-sentence-encoder")
embeddings = model.encode(["Merhaba dünya!", "Selam dünya!"])
Evaluation Results
| Metric | Score |
|---|---|
| Spearman Correlation | 0.7315 |
| Pearson Correlation | 0.8593 |
| Paraphrase Accuracy | 0.9695 |
| MRR | 0.9172 |
| Recall@1 | 0.87 |
| Recall@5 | 0.97 |
Training Details
- Training Data: Turkish paraphrase pairs (200K pairs)
- Loss Function: InfoNCE (contrastive loss)
- Temperature: 0.05
- Batch Size: 32
- Base Model: Custom Transformer encoder pretrained with MLM on Turkish text
Architecture
- Hidden Size: 512
- Layers: 12
- Attention Heads: 8
- Max Sequence Length: 64
- Vocab Size: 32,000 (Unigram tokenizer)
Limitations
- Optimized for Turkish language only
- Max sequence length is 64 tokens
- Best suited for sentence-level (not document-level) embeddings
License
Apache 2.0
- Downloads last month
- 33
Evaluation results
- accuracy on MTEB MassiveIntentClassification (tr)test set self-reported0.000
- accuracy on MTEB MassiveScenarioClassification (tr)test set self-reported0.000
- cosine_spearman on MTEB STS22 (tr)test set self-reported0.000