TabiBERT

Table of Contents

  1. Model Summary
  2. Usage
  3. Evaluation
  4. Limitations
  5. Training
  6. License
  7. Citation

Model Summary

TabiBERT is a modernized encoder-only Transformer model (BERT-style) based on the ModernBERT-base architecture. TabiBERT is pre-trained on 1 trillion tokens of a diverse dataset including Turkish, English, Code, Math with a native context length of up to 8,192 tokens.

TabiBERT inherits ModernBERT’s architectural improvements, such as:

  • Rotary Positional Embeddings (RoPE) for long-context support.
  • Local-Global Alternating Attention for efficiency on long inputs.
  • Unpadding and Flash Attention for efficient inference.

This makes TabiBERT particularly suitable for:

  • Turkish NLP tasks (classification, QA, retrieval, NLI, etc.).
  • Multilingual text understanding (Turkish-English).
  • Code retrieval and representation learning.
  • Mathematical and symbolic reasoning.
  • Long-context understanding such as document classification, retrieval, and semantic search.

TabiBERT is built by Tabilab in collaboration with VNGRS.


Usage

You can use TabiBERT directly with the transformers library (v4.48.0+):

pip install -U transformers>=4.48.0

Since TabiBERT is a Masked Language Model (MLM), you can use the fill-mask pipeline or load it via AutoModelForMaskedLM.

⚠️ If your GPU supports it, we recommend using ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:

pip install flash-attn

Example usage with AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

model_id = "boun-tabilab/TabiBERT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

text = "[MASK] Sistemi'ndeki en büyük gezegen Jüpiter'dir."
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model(**inputs)

masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_id = outputs.logits[0, masked_index].argmax(axis=-1)
print("Predicted token:", tokenizer.decode(predicted_id))
# Predicted token:  Güneş

Example with pipeline:

from transformers import pipeline

pipe = pipeline("fill-mask", model="boun-tabilab/TabiBERT")

print(pipe("[MASK], Türkiye Cumhuriyeti'nin başkentidir."))

Pre-training Data

TabiBERT has been pre-trained on 86 billion tokens of diverse data, primarily:

  • A large-scale Turkish corpus covering literature, news, social media, Wikipedia, and academic texts.
  • English text, ** code with English commentary**, and math problems in English — together making up about 13% non-Turkish tokens.

Evaluation

Evaluations are in progress.

We are currently running finetuning and benchmark evaluations for TabiBERT across the following areas:

  • Turkish NLU benchmarks (classification, NLI, sentiment, QA).
  • Multilingual retrieval (Turkish ↔ English).
  • Code retrieval tasks.

📊 Results will be published soon in this section.

Limitations

  • TabiBERT was trained mainly on Turkish, with additional English, code, and math. Its performance on English may be limited relative to Turkish, and it may underperform on other languages.
  • As with any large-scale model, it may inherit biases from training data.
  • While capable of handling up to 8k tokens, inference on very long sequences may be slower.
  • Still under evaluation — recommended to validate results before deployment in critical applications.

Training

  • Architecture: Encoder-only, Pre-Norm Transformer with GeGLU activations.
  • Sequence Length: Pre-trained up to 1,024 tokens, then extended to 8,192 tokens.
  • Data: 86 billion tokens from a union corpus (Turkish; plus English, code with English commentary, and math in English; ~13% non-Turkish).
  • Optimizer: StableAdamW with trapezoidal LR scheduling and 1-sqrt decay.
  • Hardware: Trained on 8x H100 GPUs.

License

Released under the Apache 2.0 license.

Citation

Citation is in progress.

Downloads last month
523
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support