TabiBERT
Table of Contents
Model Summary
TabiBERT is a modernized encoder-only Transformer model (BERT-style) based on the ModernBERT-base architecture. TabiBERT is pre-trained on 1 trillion tokens of a diverse dataset including Turkish, English, Code, Math with a native context length of up to 8,192 tokens.
TabiBERT inherits ModernBERT’s architectural improvements, such as:
- Rotary Positional Embeddings (RoPE) for long-context support.
- Local-Global Alternating Attention for efficiency on long inputs.
- Unpadding and Flash Attention for efficient inference.
This makes TabiBERT particularly suitable for:
- Turkish NLP tasks (classification, QA, retrieval, NLI, etc.).
- Multilingual text understanding (Turkish-English).
- Code retrieval and representation learning.
- Mathematical and symbolic reasoning.
- Long-context understanding such as document classification, retrieval, and semantic search.
TabiBERT is built by Tabilab in collaboration with VNGRS.
Usage
You can use TabiBERT directly with the transformers library (v4.48.0+):
pip install -U transformers>=4.48.0
Since TabiBERT is a Masked Language Model (MLM), you can use the fill-mask pipeline or load it via AutoModelForMaskedLM.
⚠️ If your GPU supports it, we recommend using ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:
pip install flash-attn
Example usage with AutoModelForMaskedLM:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
model_id = "boun-tabilab/TabiBERT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
text = "[MASK] Sistemi'ndeki en büyük gezegen Jüpiter'dir."
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model(**inputs)
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_id = outputs.logits[0, masked_index].argmax(axis=-1)
print("Predicted token:", tokenizer.decode(predicted_id))
# Predicted token: Güneş
Example with pipeline:
from transformers import pipeline
pipe = pipeline("fill-mask", model="boun-tabilab/TabiBERT")
print(pipe("[MASK], Türkiye Cumhuriyeti'nin başkentidir."))
Pre-training Data
TabiBERT has been pre-trained on 86 billion tokens of diverse data, primarily:
- A large-scale Turkish corpus covering literature, news, social media, Wikipedia, and academic texts.
- English text, ** code with English commentary**, and math problems in English — together making up about 13% non-Turkish tokens.
Evaluation
Evaluations are in progress.
We are currently running finetuning and benchmark evaluations for TabiBERT across the following areas:
- Turkish NLU benchmarks (classification, NLI, sentiment, QA).
- Multilingual retrieval (Turkish ↔ English).
- Code retrieval tasks.
📊 Results will be published soon in this section.
Limitations
- TabiBERT was trained mainly on Turkish, with additional English, code, and math. Its performance on English may be limited relative to Turkish, and it may underperform on other languages.
- As with any large-scale model, it may inherit biases from training data.
- While capable of handling up to 8k tokens, inference on very long sequences may be slower.
- Still under evaluation — recommended to validate results before deployment in critical applications.
Training
- Architecture: Encoder-only, Pre-Norm Transformer with GeGLU activations.
- Sequence Length: Pre-trained up to 1,024 tokens, then extended to 8,192 tokens.
- Data: 86 billion tokens from a union corpus (Turkish; plus English, code with English commentary, and math in English; ~13% non-Turkish).
- Optimizer: StableAdamW with trapezoidal LR scheduling and 1-sqrt decay.
- Hardware: Trained on 8x H100 GPUs.
License
Released under the Apache 2.0 license.
Citation
Citation is in progress.
- Downloads last month
- 523