TabiBERT

Model Summary
Usage
Evaluation
Limitations
Training
License
Citation

Model Summary

TabiBERT is a modernized encoder-only Transformer model (BERT-style) based on the ModernBERT-base architecture. TabiBERT is pre-trained on 1 trillion tokens of a diverse dataset including Turkish, English, Code, Math with a native context length of up to 8,192 tokens.

TabiBERT inherits ModernBERT’s architectural improvements, such as:

Rotary Positional Embeddings (RoPE) for long-context support.
Local-Global Alternating Attention for efficiency on long inputs.
Unpadding and Flash Attention for efficient inference.

This makes TabiBERT particularly suitable for:

Turkish NLP tasks (classification, QA, retrieval, NLI, etc.).
Multilingual text understanding (Turkish-English).
Code retrieval and representation learning.
Mathematical and symbolic reasoning.
Long-context understanding such as document classification, retrieval, and semantic search.

TabiBERT is built by Tabilab in collaboration with VNGRS.

Usage

You can use TabiBERT directly with the transformers library (v4.48.0+):

pip install -U transformers>=4.48.0

Since TabiBERT is a Masked Language Model (MLM), you can use the fill-mask pipeline or load it via AutoModelForMaskedLM.

⚠️ If your GPU supports it, we recommend using ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:

pip install flash-attn

Example usage with AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

model_id = "boun-tabilab/TabiBERT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

text = "[MASK] Sistemi'ndeki en büyük gezegen Jüpiter'dir."
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model(**inputs)

masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_id = outputs.logits[0, masked_index].argmax(axis=-1)
print("Predicted token:", tokenizer.decode(predicted_id))
# Predicted token:  Güneş

Example with pipeline:

from transformers import pipeline

pipe = pipeline("fill-mask", model="boun-tabilab/TabiBERT")

print(pipe("[MASK], Türkiye Cumhuriyeti'nin başkentidir."))

Pre-training Data

TabiBERT has been pre-trained on 86 billion tokens of diverse data, primarily:

A large-scale Turkish corpus covering literature, news, social media, Wikipedia, and academic texts.
English text, ** code with English commentary**, and math problems in English — together making up about 13% non-Turkish tokens.

Evaluation

Evaluations are in progress.

We are currently running finetuning and benchmark evaluations for TabiBERT across the following areas:

Turkish NLU benchmarks (classification, NLI, sentiment, QA).
Multilingual retrieval (Turkish ↔ English).
Code retrieval tasks.

📊 Results will be published soon in this section.

Limitations

TabiBERT was trained mainly on Turkish, with additional English, code, and math. Its performance on English may be limited relative to Turkish, and it may underperform on other languages.
As with any large-scale model, it may inherit biases from training data.
While capable of handling up to 8k tokens, inference on very long sequences may be slower.
Still under evaluation — recommended to validate results before deployment in critical applications.

Training

Architecture: Encoder-only, Pre-Norm Transformer with GeGLU activations.
Sequence Length: Pre-trained up to 1,024 tokens, then extended to 8,192 tokens.
Data: 86 billion tokens from a union corpus (Turkish; plus English, code with English commentary, and math in English; ~13% non-Turkish).
Optimizer: StableAdamW with trapezoidal LR scheduling and 1-sqrt decay.
Hardware: Trained on 8x H100 GPUs.

License

Released under the Apache 2.0 license.

Citation

Citation is in progress.

Downloads last month: 523

Safetensors

Model size

0.1B params

Tensor type

F32

boun-tabilab
/

TabiBERT